diff options
Diffstat (limited to 'format-green-albatross.mdwn')
-rw-r--r-- | format-green-albatross.mdwn | 257 |
1 files changed, 0 insertions, 257 deletions
diff --git a/format-green-albatross.mdwn b/format-green-albatross.mdwn deleted file mode 100644 index b2ef215..0000000 --- a/format-green-albatross.mdwn +++ /dev/null @@ -1,257 +0,0 @@ -[[!meta title="Repository format 'Green Albatross'"]] - -This page describes the repository format called 'Green Albatross'. It -is a preliminary format, intended to improve Obnam performance over -[[format-6]]. Current development status is PONDERING; -implementatation has started, but everything on this -page may and probably will change. Until declared stable, the on-disk -data structured WILL change without warning. Only use this format to -help with its development. - - -Introduction -============ - -For background and the big picture, please read the [[ondisk]] page. -This page only discusses details specific to this format. - -This repository format takes the following approach: - -* Don't use the [Larch](http://liw.fi/larch/) B-tree implementation. - It's intricate code and insufficiently fast. Implement any data - structures directly. -* With few, specific exceptions, repository files are never - updated. Everything is done copy-on-write. This enables caching. The - exceptions are root nodes of DAGs, so that it's easy to know where - the DAG starts. -* Data is stored in objects of various types. Objects may be small, - and to avoid having a file per object, objects are collected into - bags. This includes chunks. Each bag is stored in its own file. - Objects are identified by the bag identifier, plus an index into the - bag. -* Objects are stored in a suitable serialisation encoding. - - This is not Python pickles, since Obnam can't assume those are - stable over the lifetime of a backup repository. - - JSON is not used, since JSON is not suitable for storing binary - data, such as filenames, without adding an encoding layer on top - of JSON. - - Previously this was specified as YAML, but YAML is not fast to - parse. So it's not YAML. - - Instead, a simple, custom binary encoding will be used. This - will encode ints, booleans, binary strings, or lists or dicts - (dict keys being lists). A quick prototype shows this to be easy - (worked the first time), and fairly fast even without - optimisation. -* Encryption is done exactly like in [[format-6]]. - - -Bags -==== - -A bag is used to store a number of blobs. Bags are identified by a -random 64-bit integer. This is used as the filename of the bag. The -bags are stored in a 3-level directory structure, using the top three -octets of the bag id as the directory names. Thus, a bag whose id in -hexadecimal is 0x1234567890abcdef would be stored as -`12/34/56/1234567890abcdef.bag`. - -A bag is implemented as a Python `dict` object: - - { - 'bag-id': 'cafef00d', - 'blobs': [...], - } - -The `items` field contains the blobs. Each blob may be an arbitrary -byte string (for chunks), or an encoded Python object. - - -Object identifiers -================== - -Object identifiers are a pair consisting of the bag id and an index -into the bag. Since the identifiers are used frequently, it is -practical to store them as a unit rather than as a pair. Further, they -will be visible to the user (and, especially, the developer), so the -following syntax is used: - - BAGID.INDEX - -For example, the first and third objects stored in the bag with id -0x1234567890abcdef would be: - - 1234567890abcdef.0 - 1234567890abcdef.2 - -Note the use of hexadecimal for the bag id (so all bag identifiers are -of the same length), and indexing in decimal, starting from zero. - -We will keep bags effectively immutable so that an object id does not -need to change. This means that a bag may contain unused objects. If -it turns out that that's wasting too much data, we can "pack" bags by -replacing the unused blobs with empty values (Python's None) to save -space. This mutates a bag, but only in ways that (correct) users won't -notice. - - -Client list -=========== - -The client list is stored as `client-list/data.bag` in the repository, -and each item in the bag has the following structure: - - { - 'client-name': 'foo', - 'client-id': 123, - 'encryption-key': None, - } - - -Chunks -====== - -Chunks are stored in bags. The chunk data is just a binary blob. - - -Chunk indexes -============= - -Chunk indexes map a checksum (using the user's chosen algorithm) to a -list of chunk ids, and a chunk id to a list of client ids. The root -object of the indexes is: - - { - 'checksums': { - 'algorithm': 'SHA-1', - 'root-id': '1234567890abcdef.1', - }, - 'used-by-tree': '1234567890abcdef.0,, - } - - -Checksum to chunk ids ---------------------- - -The mapping from a checksum value to a list of chunk ids is done using -a lookup tree that is vaguely similar to a B-tree. The tree contains -index nodes and leaf nodes. Leaf nodes store the actual mappings: - - { - 'e242ed3bffccdf271b7fbaf34ed72d089537b42f': [ - '1234567890abcdef.3', - '1234567890abcdef.4', - ], - 'f1d2d2f924e986ac86fdf7b36c94bcdf32beec15': [ - '1234567890abcdef.5', - ], - } - -In other words, a leaf node is a `dict` mapping a checksum to a list -of chunk ids whose content has that checksum. Note that the list may -be longer than one item. - -An index node looks like this: - - [ - { - 'first-checksum': 'e242ed3bffccdf271b7fbaf34ed72d089537b42f', - 'last-checksum': 'f1d2d2f924e986ac86fdf7b36c94bcdf32beec15', - 'leaf-id': '1234567890abcdef.2', - 'index-id': None, - }, - ] - -In other words: - -* The index node is a list of mappings, where each mapping corresponds - to an object on the next level in the lookup tree. -* `first-checksum` is the smallest checksum in the sub-tree being - referenced. -* `last-checksum` is the largest checksum. -* `leaf-id` is the object id of the leaf node, assuming the next level - is leaf nodes. -* `index-id` is the object id of the index node, assuming the next - level is index nodes. - -The lookup tree is created in a copy-on-write manner. No node is ever -overwritten, but it may be deleted after it is no longer referenced. -The tree is not kept in balance, to keep the code maintaining as -simple as possible. - -When a new checksum is inserted into the lookup tree, it is added to -an existing leaf node by creating a new leaf node that is a copy of -the old one, and adding the new checksum to the new leaf. A big leaf -node is split in half. Any index nodes on the path to an updated leaf -node get updated. - - -Chunk id to client ids ----------------------- - -This tree is similar in structure as the checksum tree, but index nodes -look like this: - - [ - { - 'first-client-id': '1234567890abcdef.50', - 'last-client-id': '1234567890abcdef.51', - 'leaf-id': '1234567890abcdef.49', - }, - ] - -Leaf nodes look like this: - - { - '1234567890abcdef.32': [ 123, 456 ], - } - -Leaf nodes map a chunk id to a list of ids of clients that use the -chunk. - - - -Per-client data -=============== - -The data stored for each client separately starts with a CLIENT -object: - - { - 'client-name': 'foo', - 'generations': ['1234567890abcdef.77'], - } - -Each generation is a GEN object: - - { - 'started': '2015-04-06T17.35:41', - 'ended': '2015-04-06T17.35:42', - 'checkpoint': False, - 'root-dir': '1234567890abcdef.77', - } - -Each directory in the live data is stored in a DIR object. The object -stores the metadata for the directory, plus the basenames and metadata -for each file in the directory (anything thing that isn't a -subdirectory), plus the basename and DIR object id of each -subdirectory. - - { - 'metadata': { - 'st_mode': 0777, - 'st_uid': 0, - 'username': 'root', - }, - 'files': - 'README': - 'metadata': - 'st_mode': 0644, - 'st_uid': 0, - 'username': 'root' - 'subdirs': - '.git': '1234567890abcdef.127', - } - -Again, nothing is updated in-place, everything is updated -copy-on-write. When a node is updated, its parent all the way to the -root directory also get updated. |