format-green-albatross.mdwn


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254

[[!meta title="Repository format 'Green Albatross'"]]

This page describes the repository format called 'Green Albatross'. It
is a preliminary format, intended to improve Obnam performance over
[[format-6]]. Current development status is PONDERING;
implementatation has started, but everything on this
page may and probably will change. Until declared stable, the on-disk
data structured WILL change without warning. Only use this format to
help with its development.


Introduction
============

For background and the big picture, please read the [[ondisk]] page.
This page only discusses details specific to this format.

This repository format takes the following approach:

* Don't use the [Larch](http://liw.fi/larch/) B-tree implementation.
  It's intricate code and insufficiently fast. Implement any data
  structures directly.
* With few, specific exceptions, repository files are never
  updated. Everything is done copy-on-write. This enables caching. The
  exceptions are root nodes of DAGs, so that it's easy to know where
  the DAG starts.
* Data is stored in objects of various types. Objects may be small,
  and to avoid having a file per object, objects are collected into
  bags. This includes chunks. Each bag is stored in its own file.
  Objects are identified by the bag identifier, plus an index into the
  bag.
* Objects are stored in a suitable serialisation encoding.
    - This is not Python pickles, since Obnam can't assume those are
      stable over the lifetime of a backup repository.
    - JSON is not used, since JSON is not suitable for storing binary
      data, such as filenames, without adding an encoding layer on top
      of JSON.
    - Previously this was specified as YAML, but YAML is not fast to
      parse. So it's not YAML.
    - Instead, a simple, custom binary encoding will be used. This
      will encode ints, booleans, binary strings, or lists or dicts
      (dict keys being lists). A quick prototype shows this to be easy
      (worked the first time), and fairly fast even without
      optimisation.
* Encryption is done exactly like in [[format-6]].


Bags
====

A bag is used to store a number of blobs. Bags are identified by a
random 64-bit integer. This is used as the filename of the bag. The
bags are stored in a 3-level directory structure, using the top three
octets of the bag id as the directory names. Thus, a bag whose id in
hexadecimal is 0x1234567890abcdef would be stored as
`12/34/56/1234567890abcdef.bag`.

A bag is implemented as a Python `dict` object:

    {
        'bag-id': 'cafef00d',
        'blobs': [...],
    }

The `items` field contains the blobs. Each blob may be an arbitrary
byte string (for chunks), or an encoded Python object.


Object identifiers
==================

Object identifiers are a pair consisting of the bag id and an index
into the bag. Since the identifiers are used frequently, it is
practical to store them as a unit rather than as a pair. Further, they
will be visible to the user (and, especially, the developer), so the
following syntax is used:

    BAGID.INDEX

For example, the first and third objects stored in the bag with id
0x1234567890abcdef would be:

    1234567890abcdef.0
    1234567890abcdef.2

Note the use of hexadecimal for the bag id (so all bag identifiers are
of the same length), and indexing in decimal, starting from zero.

We will keep bags effectively immutable so that an object id does not
need to change. This means that a bag may contain unused objects. If
it turns out that that's wasting too much data, we can "pack" bags by
replacing the unused blobs with empty values (Python's None) to save
space. This mutates a bag, but only in ways that (correct) users won't
notice.

Client list
===========

The client list is stored as `client-list/data.bag` in the repository,
and each item in the bag has the following structure:

    {
        'client-name': 'foo',
        'client-id': 123,
        'encryption-key': None,
    }

Chunks
======

Chunks are stored in bags. The chunk data is just a binary blob.


Chunk indexes
=============

Chunk indexes map a checksum (using the user's chosen algorithm) to a
list of chunk ids, and a chunk id to a list of client ids. The root
object of the indexes is:

    {
        'checksums': {
            'algorithm': 'SHA-1',
            'root-id': '1234567890abcdef.1',
        },
        'used-by-tree': '1234567890abcdef.0,,
    }

Checksum to chunk ids
---------------------

The mapping from a checksum value to a list of chunk ids is done using
a lookup tree that is vaguely similar to a B-tree. The tree contains
index nodes and leaf nodes. Leaf nodes store the actual mappings:

    {
        'e242ed3bffccdf271b7fbaf34ed72d089537b42f': [
            '1234567890abcdef.3',
            '1234567890abcdef.4',
        ],
        'f1d2d2f924e986ac86fdf7b36c94bcdf32beec15': [
            '1234567890abcdef.5',
        ],
    }

In other words, a leaf node is a `dict` mapping a checksum to a list
of chunk ids whose content has that checksum. Note that the list may
be longer than one item.

An index node looks like this:

    [
        {
            'first-checksum': 'e242ed3bffccdf271b7fbaf34ed72d089537b42f',
            'last-checksum': 'f1d2d2f924e986ac86fdf7b36c94bcdf32beec15',
            'leaf-id': '1234567890abcdef.2',
            'index-id': None,
        },
    ]

In other words:

* The index node is a list of mappings, where each mapping corresponds
  to an object on the next level in the lookup tree.
* `first-checksum` is the smallest checksum in the sub-tree being
  referenced.
* `last-checksum` is the largest checksum.
* `leaf-id` is the object id of the leaf node, assuming the next level
  is leaf nodes.
* `index-id` is the object id of the index node, assuming the next
  level is index nodes.

The lookup tree is created in a copy-on-write manner. No node is ever
overwritten, but it may be deleted after it is no longer referenced.
The tree is not kept in balance, to keep the code maintaining as
simple as possible.

When a new checksum is inserted into the lookup tree, it is added to
an existing leaf node by creating a new leaf node that is a copy of
the old one, and adding the new checksum to the new leaf. A big leaf
node is split in half. Any index nodes on the path to an updated leaf
node get updated.


Chunk id to client ids
----------------------

This tree is similar in structure as the checksum tree, but index nodes
look like this:

    [
        {
            'first-client-id': '1234567890abcdef.50',
            'last-client-id': '1234567890abcdef.51',
            'leaf-id': '1234567890abcdef.49',
        },
    ]

Leaf nodes look like this:

    {
        '1234567890abcdef.32': [ 123, 456 ],
    }

Leaf nodes map a chunk id to a list of ids of clients that use the
chunk.


Per-client data
===============

The data stored for each client separately starts with a CLIENT
object:

    {
        'client-name': 'foo',
        'generations': ['1234567890abcdef.77'],
    }

Each generation is a GEN object:

    {
        'started': '2015-04-06T17.35:41',
        'ended': '2015-04-06T17.35:42',
        'checkpoint': False,
        'root-dir': '1234567890abcdef.77',
    }

Each directory in the live data is stored in a DIR object. The object
stores the metadata for the directory, plus the basenames and metadata
for each file in the directory (anything thing that isn't a
subdirectory), plus the basename and DIR object id of each
subdirectory.

    {
        'metadata': {
            'st_mode': 0777,
            'st_uid': 0,
            'username': 'root',
        },
        'files':
            'README':
                'metadata':
                    'st_mode': 0644,
                    'st_uid': 0,
                    'username': 'root'
        'subdirs':
            '.git': '1234567890abcdef.127',
    }

Again, nothing is updated in-place, everything is updated
copy-on-write. When a node is updated, its parent all the way to the
root directory also get updated.