summaryrefslogtreecommitdiff
path: root/format-green-albatross.mdwn
diff options
context:
space:
mode:
Diffstat (limited to 'format-green-albatross.mdwn')
-rw-r--r--format-green-albatross.mdwn257
1 files changed, 0 insertions, 257 deletions
diff --git a/format-green-albatross.mdwn b/format-green-albatross.mdwn
deleted file mode 100644
index b2ef215..0000000
--- a/format-green-albatross.mdwn
+++ /dev/null
@@ -1,257 +0,0 @@
-[[!meta title="Repository format 'Green Albatross'"]]
-
-This page describes the repository format called 'Green Albatross'. It
-is a preliminary format, intended to improve Obnam performance over
-[[format-6]]. Current development status is PONDERING;
-implementatation has started, but everything on this
-page may and probably will change. Until declared stable, the on-disk
-data structured WILL change without warning. Only use this format to
-help with its development.
-
-
-Introduction
-============
-
-For background and the big picture, please read the [[ondisk]] page.
-This page only discusses details specific to this format.
-
-This repository format takes the following approach:
-
-* Don't use the [Larch](http://liw.fi/larch/) B-tree implementation.
- It's intricate code and insufficiently fast. Implement any data
- structures directly.
-* With few, specific exceptions, repository files are never
- updated. Everything is done copy-on-write. This enables caching. The
- exceptions are root nodes of DAGs, so that it's easy to know where
- the DAG starts.
-* Data is stored in objects of various types. Objects may be small,
- and to avoid having a file per object, objects are collected into
- bags. This includes chunks. Each bag is stored in its own file.
- Objects are identified by the bag identifier, plus an index into the
- bag.
-* Objects are stored in a suitable serialisation encoding.
- - This is not Python pickles, since Obnam can't assume those are
- stable over the lifetime of a backup repository.
- - JSON is not used, since JSON is not suitable for storing binary
- data, such as filenames, without adding an encoding layer on top
- of JSON.
- - Previously this was specified as YAML, but YAML is not fast to
- parse. So it's not YAML.
- - Instead, a simple, custom binary encoding will be used. This
- will encode ints, booleans, binary strings, or lists or dicts
- (dict keys being lists). A quick prototype shows this to be easy
- (worked the first time), and fairly fast even without
- optimisation.
-* Encryption is done exactly like in [[format-6]].
-
-
-Bags
-====
-
-A bag is used to store a number of blobs. Bags are identified by a
-random 64-bit integer. This is used as the filename of the bag. The
-bags are stored in a 3-level directory structure, using the top three
-octets of the bag id as the directory names. Thus, a bag whose id in
-hexadecimal is 0x1234567890abcdef would be stored as
-`12/34/56/1234567890abcdef.bag`.
-
-A bag is implemented as a Python `dict` object:
-
- {
- 'bag-id': 'cafef00d',
- 'blobs': [...],
- }
-
-The `items` field contains the blobs. Each blob may be an arbitrary
-byte string (for chunks), or an encoded Python object.
-
-
-Object identifiers
-==================
-
-Object identifiers are a pair consisting of the bag id and an index
-into the bag. Since the identifiers are used frequently, it is
-practical to store them as a unit rather than as a pair. Further, they
-will be visible to the user (and, especially, the developer), so the
-following syntax is used:
-
- BAGID.INDEX
-
-For example, the first and third objects stored in the bag with id
-0x1234567890abcdef would be:
-
- 1234567890abcdef.0
- 1234567890abcdef.2
-
-Note the use of hexadecimal for the bag id (so all bag identifiers are
-of the same length), and indexing in decimal, starting from zero.
-
-We will keep bags effectively immutable so that an object id does not
-need to change. This means that a bag may contain unused objects. If
-it turns out that that's wasting too much data, we can "pack" bags by
-replacing the unused blobs with empty values (Python's None) to save
-space. This mutates a bag, but only in ways that (correct) users won't
-notice.
-
-
-Client list
-===========
-
-The client list is stored as `client-list/data.bag` in the repository,
-and each item in the bag has the following structure:
-
- {
- 'client-name': 'foo',
- 'client-id': 123,
- 'encryption-key': None,
- }
-
-
-Chunks
-======
-
-Chunks are stored in bags. The chunk data is just a binary blob.
-
-
-Chunk indexes
-=============
-
-Chunk indexes map a checksum (using the user's chosen algorithm) to a
-list of chunk ids, and a chunk id to a list of client ids. The root
-object of the indexes is:
-
- {
- 'checksums': {
- 'algorithm': 'SHA-1',
- 'root-id': '1234567890abcdef.1',
- },
- 'used-by-tree': '1234567890abcdef.0,,
- }
-
-
-Checksum to chunk ids
----------------------
-
-The mapping from a checksum value to a list of chunk ids is done using
-a lookup tree that is vaguely similar to a B-tree. The tree contains
-index nodes and leaf nodes. Leaf nodes store the actual mappings:
-
- {
- 'e242ed3bffccdf271b7fbaf34ed72d089537b42f': [
- '1234567890abcdef.3',
- '1234567890abcdef.4',
- ],
- 'f1d2d2f924e986ac86fdf7b36c94bcdf32beec15': [
- '1234567890abcdef.5',
- ],
- }
-
-In other words, a leaf node is a `dict` mapping a checksum to a list
-of chunk ids whose content has that checksum. Note that the list may
-be longer than one item.
-
-An index node looks like this:
-
- [
- {
- 'first-checksum': 'e242ed3bffccdf271b7fbaf34ed72d089537b42f',
- 'last-checksum': 'f1d2d2f924e986ac86fdf7b36c94bcdf32beec15',
- 'leaf-id': '1234567890abcdef.2',
- 'index-id': None,
- },
- ]
-
-In other words:
-
-* The index node is a list of mappings, where each mapping corresponds
- to an object on the next level in the lookup tree.
-* `first-checksum` is the smallest checksum in the sub-tree being
- referenced.
-* `last-checksum` is the largest checksum.
-* `leaf-id` is the object id of the leaf node, assuming the next level
- is leaf nodes.
-* `index-id` is the object id of the index node, assuming the next
- level is index nodes.
-
-The lookup tree is created in a copy-on-write manner. No node is ever
-overwritten, but it may be deleted after it is no longer referenced.
-The tree is not kept in balance, to keep the code maintaining as
-simple as possible.
-
-When a new checksum is inserted into the lookup tree, it is added to
-an existing leaf node by creating a new leaf node that is a copy of
-the old one, and adding the new checksum to the new leaf. A big leaf
-node is split in half. Any index nodes on the path to an updated leaf
-node get updated.
-
-
-Chunk id to client ids
-----------------------
-
-This tree is similar in structure as the checksum tree, but index nodes
-look like this:
-
- [
- {
- 'first-client-id': '1234567890abcdef.50',
- 'last-client-id': '1234567890abcdef.51',
- 'leaf-id': '1234567890abcdef.49',
- },
- ]
-
-Leaf nodes look like this:
-
- {
- '1234567890abcdef.32': [ 123, 456 ],
- }
-
-Leaf nodes map a chunk id to a list of ids of clients that use the
-chunk.
-
-
-
-Per-client data
-===============
-
-The data stored for each client separately starts with a CLIENT
-object:
-
- {
- 'client-name': 'foo',
- 'generations': ['1234567890abcdef.77'],
- }
-
-Each generation is a GEN object:
-
- {
- 'started': '2015-04-06T17.35:41',
- 'ended': '2015-04-06T17.35:42',
- 'checkpoint': False,
- 'root-dir': '1234567890abcdef.77',
- }
-
-Each directory in the live data is stored in a DIR object. The object
-stores the metadata for the directory, plus the basenames and metadata
-for each file in the directory (anything thing that isn't a
-subdirectory), plus the basename and DIR object id of each
-subdirectory.
-
- {
- 'metadata': {
- 'st_mode': 0777,
- 'st_uid': 0,
- 'username': 'root',
- },
- 'files':
- 'README':
- 'metadata':
- 'st_mode': 0644,
- 'st_uid': 0,
- 'username': 'root'
- 'subdirs':
- '.git': '1234567890abcdef.127',
- }
-
-Again, nothing is updated in-place, everything is updated
-copy-on-write. When a node is updated, its parent all the way to the
-root directory also get updated.