format-6.mdwn


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403

[[!meta title="Repository format 6"]]

NOTE: This was originally written as the sole description of Obnam's
on-disk repository storage. At the time, there was no other repository
format, and so this text doesn't separate that which is generic from
that which is specific. This page needs to be updated to discuss the
specific parts only. For the generic parts, see [[ondisk]].

[[!toc startlevel=2 levels=2]]

Obnam is a backup tool for making **backups to hard disks,** either
local ones or over the network. It does not work with tape drives,
or optical disks. The **backup server is stupid:** it does
not run any Obnam specific code, only ssh (sftp).

The location where
backups are stored is called the **backup store**, and may be shared between
multiple co-operating clients. The backup store resides on a backup server. The
computers that own the data being backed up are the **clients**.

Performance is obviously a consideration. It has many aspects:

* run-time
* RAM
* disk space, both server and client
* disk bandwidth, both server and client
* network bandwidth, between server and client

To achieve its performance goals, 
Obnam uses the B-tree variant devised by Ohad Rodeh[1], also
used by btrfs. This document is describes how it does that.


Characteristics of the B-tree
-----------------------------

The **Rodeh B-trees** are designed for shadowing, i.e., 
**copy-on-write updates**
of tree nodes. Nodes are not modified, instead a new copy is created
to replace it. This allows an efficient way to copy a tree and to update
the copy, while keeping the original tree intact.
The two trees will share as many nodes as possible.
Obnam calls the collection of related trees a **forest.**

The trees use **fixed-length binary strings as keys.**
For each key, a **variable-length binary string is stored as the value.**
The tree consists of interior nodes and leaf
nodes; all values are stored in leaves only. There is a maximum size for
nodes, which in some applications would be a raw disk block, but is
a separate file for Obnam. 
Value strings can be as big as a node (minus a small header), but no bigger.

Lookups and removals from the tree can be done using **ranges:** all keys
within a range are looked up, or removed from the tree. This is an important
optimization and opens up some interesting design possibilities.


File data (chunk) storage
-------------------------

File data is not stored directly in B-trees, but externally. Data
is broken up into **chunks** of suitable size (say, 64 KiB) 
and assigned identifiers,
which also act as filenames. Given a chunk id, its contents can be easily
retrieved.

**Chunks are shared** within and across clients. 
This allows clients to avoid
uploading data that another client has already uploaded,
or storing the same data twice. The same mechanism allows Obnam to
efficiently back up files that have moved, or that have been modified,
or both at the same time.
If a chunk of data already exists in the store, Obnam tries hard to avoid
having to back it up again.

To back up a chunk, Obnam chooses a random, unused 64-bit identifier,
and uploads a file of that name. The next chunk uses the next
identifier, until it hits one that has been used, at which point
it picks a new random one.

Chunks are managed using reference counts. We will cover these later,
after discussing other details of the store. See
section "Chunks, part 2" below.

Overview of store forests
-------------------------

The backup store consists of several forests:

* client list: map client names to 64-bit identifiers
* chunk list: map chunk identifiers to checksums
* chunk checksums: map chunk checksum to identifiers
* per-client data: backup generations for each client

Locking
-------

Locking is done only for writes. 
Reads are always allowed, even while write locks exist.
This allows race conditions between readers and writers,
but thanks to copy-on-write updates
those are no different from files getting corrupted or
deleted on the server by hardware failures,
and can be treated the same.

Each forest is locked separately.
If more than one forest needs to be locked at the same time,
they are locked in an order sorted by the name of the forest,
using the C locale.
If a client fails to lock a particular forest,
it releases all the locks it holds,
and tries again.
This avoids deadlocks,
but allows starving.
The most likely reason for starving is too many clients sharing the
same store, and that needs to be solved by increasing backup server
performance, or having more stores.

Locks are implemented as files, which are created atomically.
Each lock file contains the name of the host that holds it
(which might not be a backup client),
and the process id on that client, and the time of creating the lock.
If the time is very old, another client may decide to break the lock.
If the backup store is encryped, then also the contents of the
lock file is encrypted.

To reduce lock congestion, each client attempts to keep a lock for as
short a time as possible. For per-client data, this means keeping the lock
for the duration of the backup. For shared forests, updates can be
spooled: the shared forest is used read-only until the end of the backup
run, or until a checkpoint, and updated then, as quickly as possible.

Client list forest
----------------

The client list forest consists of a single tree.
Its key is consists of a tuple:

* 128 bits of MD5 checksum of the name of the client
* 64 bits of client id (for hash collision avoidance)

The client name is typically its host name, but might be anything.
It is up to the sysadmins of the clients to co-operate so that each
client has a unique name.

It is unlikely that there will be checksum collisions for client names,
but it's easy to not have to worry about that. The client id will be
chosen randomly.

Chunk list forest
-----------------

The chunk list forest consists of a single tree.
Its key consists of a 64-bit integer containing a chunk identifier.
The value is the MD5 checksum for the chunk.

This tree is needed so that it is possible to quickly find the
checksum of a chunk, given its identifier, when removing generations
from the per-client forest.

Chunk checksum forest
---------------------

The chunk checksum forest uses a tuple as a key:

* 128-bit MD5 checksum of the data in the chunk
* 64-bit chunk id
* 64-bit client id

The chunk id is used to handle checksum collisions.
While MD5 collisions are not likely in general use,
they are certain for people who research cryptographic hashes,
so Obnam needs to be able to handle them.
When Obnam has a chunk it wants to back up,
it does a range lookup from (md5,0) through (md5,chunkd_id_max),
and then checks if the chunk is identical to any of the chunks
already in the store.
This checking is expensive, but safety may more important than
performance here. (Obnam may provide an option to disable the
data comparison check, and rely only on the hashes.)

The value is empty.

Per-client forest
---------------

The per-client forest has one tree per backup generation or checkpoint,
and represents a snapshot of the filesystem at a specific time,
or as close to a snapshot as Obnam can make it. Obnam does not 
freeze the filesystem while the backup runs, so it cannot guarantee
a true snapshot.

The forest uses a tuple as a key:

* 8-bit prefix
* 64-bit main key
* 8-bit subkey type
* 64-bit subkey

The prefix is one of the following:

* 0 for filesystem metadata
* 1 for chunk references
* 2 for generation metadata

**For prefix 0,** the main key is the file identifier.
See below for how they are generated.

The subkey type and subkey and value are using as follows:

[[!table data="""
subkey type     | subkey        | value
0               | file-id       | full pathname to file
1               | counter 0...  | up to N chunk-ids
2               | 0             | number of chunk-ids stored
3               | 0...          | inode fields, user/group name, xattr, ...
"""]]

Subkey type 0 is used to handle hash collisions for the pathnames.
They are unlikely, but it is entirely unacceptable for Obnam to fail
if they happen. Each file has an identifier that is unique within the
generation: there is a one-to-one mapping between fully qualified pathnames
and file-ids within a generation (but not across generations). 

By default, the file-id is one half of
the 128-bit MD5 of the full pathname to the file. To check for collisions,
we see if the value for key (1, default-file-id, 0, file-id) is the full
pathname for the file we are interested in. If it is not, we generate a
new file-id by some deterministic approach, and do the lookup again,
and repeat this until we find a free file-id.
Note that the default-file-id in the lookup stays the same, it will
always be the half of the MD5 of the pathname.

Subkey type 1 is used to store a list of chunks that represent the contents
of the file. Since a file can be quite large, we store more than one chunk-id
per value, to reduce the overhead a bit. Benchmarking will determine the
suitable number of chunk-ids to put in each value, but probably the size
of the value should not be larger than about a quarter of a node.
To avoid having to do a range lookup to find the next counter, when
appending new chunk ids to an existing list, we store the number of items
at subkey type 2 (subkey 0).

Subkey type 2 is used for file metadata, such as inode data
(`st_mtime`, `st_size`, etc), the name of the owner and group for
the file, xattr fields, etc. To reduce space use, the most common
metadata is stored in subkey value 0, encoded in a special format.
Other subkey values are used for additional metadata, which also
allows for future expansion.

**For prefix 1,** the rest of the key is used like this:

* main key is the chunk-id
* subkey type is 0
* subkey is file-id for file that uses the chunk

For prefix 1, the value is always empty. Whenever Obnam backs up a 
file, it adds the key (1, chunk-id, 0, file-id). If the file existed
in the previous generation, but not in the new generation, it removes
the key.

**For prefix 2,** the main key is fixed as the hash of
the string "generation", and the subkey type is fixed at 0.
The subkey is one of the values in the following table.

[[!table data="""
subkey  |value
0       |generation id
1       |timestamp for the start of the generation
2       |timestamp for the end of the generation
3       |boolean (0 or 1) for whether the generation is a checkpoint
"""]]

Timestamps are expressed in UTC as seconds since the beginning of 1970
(i.e, Unix epoch).

Chunks, part 2
--------------

When backing up, Obnam will do roughly this:

* clone the tree for previous generation
* traverse the filesystem tree, adding and removing files from the new tree
* when backing up a file's chunks, look up each chunk in the store, and add it
  if it is missing; after this, the id of the chunk is known
* add (1, chunk-id, 0, file-id) to the per-client forest,
  (chunk-id) to the chunk list forest,
  and (checksum, chunk-id, client-id) to the chunk checksum forest 
  for any new chunks it uses; the updates to the chunk list and checksum 
  forests can be batched to the end of the backup run
* remove (1, chunk-id, 0, file-id) from the per-client forest, 
  for every chunk for any file it removes from the new generation
  (no need to remove anything from the chunk checksum forest, since
  previous generations will still use the chunks)

When removing a generation, Obnam will do roughly this:

* look up every chunk-id used in that generation
* look up each chunk-id in every other generation
* if not found in any other generation, remove
  (checksum, chunk-id, client-id) from the chunk checksum forest;
  look up the checksum from the chunk list forest
* if there are no keys in the range (checksum, chunk-id, 0) through
  (checksum, chunk-id, client_id_max) in the chunk checksum forest, 
  then remove (chunk-id) from chunk list forest, and
  chunk file from disk
* remove the tree for the unwanted generation


Repository metadata
-------------------

There is a need to store some metadata about the repository.
This will be done in the `metadata` directory. Initially, the
only piece of information is the version of the on-disk format,
expressed as a single integer, stored in the file `metadata/format`.

Each version of Obnam will know which on-disk formats it supports.
With the `metadata/format` file it knows if it can handle a 
particular format.

Upgrades to newer formats will happen explicitly, not implicitly.


Encryption
----------

Obnam may optionally encrypt data in the backup store.
The design for this is unclear, but will need to solve at least
the following problems:

* each client should have access only to its own data, but still
  allow sharing of data across clients
* the server admin should be able to disable access for each client,
  or anyone who's compromised the client's encryption key
* the server admin should be able to do some operations on any client,
  such as expire generations, remove entire clients and their corresponding
  chunks, verify correctness of the whole data store ("fsck"), or
  restore data after the client has been entirely lost


On checksum algorithms
----------------------

This design uses MD5 checksums for everything.
It does not require the checksum algorithm to be cryptographically
secure, and is content with having a hash that is unlikely to result
in collisions, and only cares about the probability of collisions
for performance, not correctness.
The design handles collisions, and does not assume two pieces of data
are identical just because the hashes are the same.
A shorter or faster checksum algorithm may be used instead.
Or a cryptographically stronger one.
MD5 is not recommended for applications that rely on the security of
the hash, but this design does not rely on that.

However, it may be quite expensive to verify that data chunks with
the same checksum are the same. It may require reading back the chunk
from the server. This may not be acceptable to all users. Some users
may be willing to live with the chance of a checksum collision.
Other users might not want to pay the price for de-duplication.
Thus, it may be best for obnam to provide a tri-state option:
full verification, rely on checksums, and no de-duplication. Further,
it may be a good idea to allow the user to specify the checksum
algorithm.


Comments?
---------

Edit this page directly, or mail me at <mailto:liw@liw.fi>. Thanks.


Author
------

Obnam is being developed by [Lars Wirzenius](http://liw.fi).
Feedback and patches are most welcome.


References
----------

1. Rodeh, Ohad: "B-trees, Shadowing, and Clones". IBM Haifa Research Labs.
   <http://portal.acm.org/citation.cfm?id=1326544>
   <http://www.cs.tau.ac.il/~ohadrode/papers/btree_TOS.pdf>


Changes
-------

* 2010-05-04: Modified the way hash collisions for filenames are
  handled, based on suggestion from Richard Braakman: store secondary
  hash and filename with primary hash.
* 2010-07-11: Minor tweaks.
* 2010-11-11: Change tense to current, add some notes and minor fixes.
* 2010-11-11: Add design for removal of chunks, some considerations for
  encryption, more details on how keys are used by various forests in
  the store, locking, and other clarifications.
* 2010-11-11: Refer to hosts as clients, for clarity.
* 2010-11-12: Typo fixes and checksum verification tri-state option
  suggestion from Richard Braakman.
* 2010-11-18: Add generation metadata to per-client forest.
* 2010-11-29: Add chunk list forest.