blog/2022/02/23/planning.mdwn


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138

[[!meta title="Iteration planning: February 26 &ndash; March 12"]]
[[!meta date="Wed, 23 Feb 2022 18:52:11 +0200"]]
[[!tag meeting]]

[[!toc levels=2]]

# Assessment of the iteration that has ended

[previous iteration]: /blog/2022/01/07/planning
[obnam-benchmark-results]: https://gitlab.com/obnam/obnam-benchmark-results
[benchmark results page]: https://doc.obnam.org/obnam-benchmark-results/

The goal of the [previous iteration][] was:

> The goal for this iteration is to define an initial set of benchmarks
> for Obnam, and to run them, and to publish the results on
> doc.obnam.org. All of this should be made as automatic as possible.

This was completed. The running of benchmarks is manual, and Lars will
do that for every release, going forward. The results will be put into
the [obnam-benchmark-results][] repository, which will trigger CI to
update the [benchmark results page][] with a summary of the results.

This iteration was meant to also fix the following issues:

* [[!issue 174]] -- _Doesn't log performance metrics_
  - Lars created [[!mr 214]], but it's not merged yet. Alexander made
    good suggestions for tidying up the code, but Lars failed to make
    them work, and then got distracted by performance investigations.
    They can be worked on later.
* [[!issue 176]] -- _Doesn't report version with much detail_
  - Lars didn't work on this after all.
* [[!issue 180]] -- _Chunk metadata should be in AAD, not in headers?_
  - Lars didn't work on this after all.


The iteration ran over a few weeks, mostly due to the northern
hemisphere Darkness affecting Lars's ability to be productive, and
also Lars got distracted by looking at improving Obnam performance.

# Discussion

## Current development theme

The current theme of development for Obnam is performance, because
that is currently Lars's primary worry. The choices are performance,
security, convenience, at least currently.

## Performance

Lars has been investigating where Obnam performance bottlenecks are,
by running benchmarks, and looking at profiling results from [cargo
flamegraph][]. For an Obnam run with a good number of files that
haven't changed, most of the time in Obnam goes into inserting rows
into an SQLite database for the new generation. This led Lars to do
some investigation into how fast he can make this happen.

Lars wrote a little program that creates an SQLite database and the
inserts a million rows into a table modelled after the Obnam `files`
table. The first, naive approach resulted in about 80,000 rows
inserted per second on his laptop, and nearly 120,000 on his
development server. After reading an [article by Jason Wyatt][] Lars
then did the following changes:

* use a single transaction for all million inserts
* use the `rusqlite` prepared statement cache instead of preparing a
  new statement for each insert

The resulting speeds were (best speed of three runs, compiled in
release mode, on development server with NVMe drives):

program                       inserts/s
--------                    -----------
individual-insert                117509
individual-one-transaction       209512
individual-prepared.rs           970874

That's almost a million inserts per second. That'll do for now.

Another approach might be to modify a copy of the previous generation,
but the logic gets trickier than with the approach of starting with an
empty database and inserting what we find in live data.

Lars also looked at what it would take to change the current Obnam
abstractions around SQLite to use the approach used above. He feels
the Obnam abstractions he wrote originally are messy and could do with
a better abstraction. He intends to work on that in the new iteration.

[cargo flamegraph]: https://crates.io/crates/flamegraph
[article by Jason Wyatt]: https://medium.com/@JasonWyatt/squeezing-performance-from-sqlite-insertions-971aff98eef2

# Repository review

Lars didn't review any issues, merge requests, or CI pipelines this
time. He wants to work on database abstractions first.

# Goals

## Goal for 1.0 (not changed this iteration)

The goal for version 1.0 is for Obnam to be an utterly boring backup
solution for Linux command line users. It should just work, be
performant, secure, and well-documented.

It is not a goal for version 1.0 to have been ported to other
operating systems, but if there are volunteers to do that, and to
commit to supporting their port, ports will be welcome.

Other user interfaces is likely to happen only after 1.0.

The server component will support multiple clients in a way that
doesn’t let them see each other’s data. It is not a goal for clients
to be able to share data, even if the clients trust each other.

## Goal for the next few iterations (not changed for this iteration)

The goal for next few iterations is to have Obnam be performant. This
will include, at least, making the client use more concurrency so that
it can use more CPU cores to compute checksums for de-duplication.

## Goal for this iteration (new for this iteration)

The goal for this iteration is to tidy up database abstraction code in
the Obnam client and implement the performance improvements Lars
did prototype code for.

# Commitments for this iteration

Lars will work on Obnam client database abstractions and performance.
The goal for these is for Obnam to be able to run `obnam backup` on a
live data set of a million files that haven't changed since the
previous backup in less than 30 seconds, on Lars's development server.

This work is not captured in issues.

# Meeting participants

* Lars Wirzenius