summaryrefslogtreecommitdiff
path: root/format-6.mdwn
diff options
context:
space:
mode:
Diffstat (limited to 'format-6.mdwn')
-rw-r--r--format-6.mdwn403
1 files changed, 0 insertions, 403 deletions
diff --git a/format-6.mdwn b/format-6.mdwn
deleted file mode 100644
index 7acfe0f..0000000
--- a/format-6.mdwn
+++ /dev/null
@@ -1,403 +0,0 @@
-[[!meta title="Repository format 6"]]
-
-NOTE: This was originally written as the sole description of Obnam's
-on-disk repository storage. At the time, there was no other repository
-format, and so this text doesn't separate that which is generic from
-that which is specific. This page needs to be updated to discuss the
-specific parts only. For the generic parts, see [[ondisk]].
-
-[[!toc startlevel=2 levels=2]]
-
-Obnam is a backup tool for making **backups to hard disks,** either
-local ones or over the network. It does not work with tape drives,
-or optical disks. The **backup server is stupid:** it does
-not run any Obnam specific code, only ssh (sftp).
-
-The location where
-backups are stored is called the **backup store**, and may be shared between
-multiple co-operating clients. The backup store resides on a backup server. The
-computers that own the data being backed up are the **clients**.
-
-Performance is obviously a consideration. It has many aspects:
-
-* run-time
-* RAM
-* disk space, both server and client
-* disk bandwidth, both server and client
-* network bandwidth, between server and client
-
-To achieve its performance goals,
-Obnam uses the B-tree variant devised by Ohad Rodeh[1], also
-used by btrfs. This document is describes how it does that.
-
-
-Characteristics of the B-tree
------------------------------
-
-The **Rodeh B-trees** are designed for shadowing, i.e.,
-**copy-on-write updates**
-of tree nodes. Nodes are not modified, instead a new copy is created
-to replace it. This allows an efficient way to copy a tree and to update
-the copy, while keeping the original tree intact.
-The two trees will share as many nodes as possible.
-Obnam calls the collection of related trees a **forest.**
-
-The trees use **fixed-length binary strings as keys.**
-For each key, a **variable-length binary string is stored as the value.**
-The tree consists of interior nodes and leaf
-nodes; all values are stored in leaves only. There is a maximum size for
-nodes, which in some applications would be a raw disk block, but is
-a separate file for Obnam.
-Value strings can be as big as a node (minus a small header), but no bigger.
-
-Lookups and removals from the tree can be done using **ranges:** all keys
-within a range are looked up, or removed from the tree. This is an important
-optimization and opens up some interesting design possibilities.
-
-
-File data (chunk) storage
--------------------------
-
-File data is not stored directly in B-trees, but externally. Data
-is broken up into **chunks** of suitable size (say, 64 KiB)
-and assigned identifiers,
-which also act as filenames. Given a chunk id, its contents can be easily
-retrieved.
-
-**Chunks are shared** within and across clients.
-This allows clients to avoid
-uploading data that another client has already uploaded,
-or storing the same data twice. The same mechanism allows Obnam to
-efficiently back up files that have moved, or that have been modified,
-or both at the same time.
-If a chunk of data already exists in the store, Obnam tries hard to avoid
-having to back it up again.
-
-To back up a chunk, Obnam chooses a random, unused 64-bit identifier,
-and uploads a file of that name. The next chunk uses the next
-identifier, until it hits one that has been used, at which point
-it picks a new random one.
-
-Chunks are managed using reference counts. We will cover these later,
-after discussing other details of the store. See
-section "Chunks, part 2" below.
-
-Overview of store forests
--------------------------
-
-The backup store consists of several forests:
-
-* client list: map client names to 64-bit identifiers
-* chunk list: map chunk identifiers to checksums
-* chunk checksums: map chunk checksum to identifiers
-* per-client data: backup generations for each client
-
-Locking
--------
-
-Locking is done only for writes.
-Reads are always allowed, even while write locks exist.
-This allows race conditions between readers and writers,
-but thanks to copy-on-write updates
-those are no different from files getting corrupted or
-deleted on the server by hardware failures,
-and can be treated the same.
-
-Each forest is locked separately.
-If more than one forest needs to be locked at the same time,
-they are locked in an order sorted by the name of the forest,
-using the C locale.
-If a client fails to lock a particular forest,
-it releases all the locks it holds,
-and tries again.
-This avoids deadlocks,
-but allows starving.
-The most likely reason for starving is too many clients sharing the
-same store, and that needs to be solved by increasing backup server
-performance, or having more stores.
-
-Locks are implemented as files, which are created atomically.
-Each lock file contains the name of the host that holds it
-(which might not be a backup client),
-and the process id on that client, and the time of creating the lock.
-If the time is very old, another client may decide to break the lock.
-If the backup store is encryped, then also the contents of the
-lock file is encrypted.
-
-To reduce lock congestion, each client attempts to keep a lock for as
-short a time as possible. For per-client data, this means keeping the lock
-for the duration of the backup. For shared forests, updates can be
-spooled: the shared forest is used read-only until the end of the backup
-run, or until a checkpoint, and updated then, as quickly as possible.
-
-Client list forest
-----------------
-
-The client list forest consists of a single tree.
-Its key is consists of a tuple:
-
-* 128 bits of MD5 checksum of the name of the client
-* 64 bits of client id (for hash collision avoidance)
-
-The client name is typically its host name, but might be anything.
-It is up to the sysadmins of the clients to co-operate so that each
-client has a unique name.
-
-It is unlikely that there will be checksum collisions for client names,
-but it's easy to not have to worry about that. The client id will be
-chosen randomly.
-
-Chunk list forest
------------------
-
-The chunk list forest consists of a single tree.
-Its key consists of a 64-bit integer containing a chunk identifier.
-The value is the MD5 checksum for the chunk.
-
-This tree is needed so that it is possible to quickly find the
-checksum of a chunk, given its identifier, when removing generations
-from the per-client forest.
-
-Chunk checksum forest
----------------------
-
-The chunk checksum forest uses a tuple as a key:
-
-* 128-bit MD5 checksum of the data in the chunk
-* 64-bit chunk id
-* 64-bit client id
-
-The chunk id is used to handle checksum collisions.
-While MD5 collisions are not likely in general use,
-they are certain for people who research cryptographic hashes,
-so Obnam needs to be able to handle them.
-When Obnam has a chunk it wants to back up,
-it does a range lookup from (md5,0) through (md5,chunkd_id_max),
-and then checks if the chunk is identical to any of the chunks
-already in the store.
-This checking is expensive, but safety may more important than
-performance here. (Obnam may provide an option to disable the
-data comparison check, and rely only on the hashes.)
-
-The value is empty.
-
-Per-client forest
----------------
-
-The per-client forest has one tree per backup generation or checkpoint,
-and represents a snapshot of the filesystem at a specific time,
-or as close to a snapshot as Obnam can make it. Obnam does not
-freeze the filesystem while the backup runs, so it cannot guarantee
-a true snapshot.
-
-The forest uses a tuple as a key:
-
-* 8-bit prefix
-* 64-bit main key
-* 8-bit subkey type
-* 64-bit subkey
-
-The prefix is one of the following:
-
-* 0 for filesystem metadata
-* 1 for chunk references
-* 2 for generation metadata
-
-**For prefix 0,** the main key is the file identifier.
-See below for how they are generated.
-
-The subkey type and subkey and value are using as follows:
-
-[[!table data="""
-subkey type | subkey | value
-0 | file-id | full pathname to file
-1 | counter 0... | up to N chunk-ids
-2 | 0 | number of chunk-ids stored
-3 | 0... | inode fields, user/group name, xattr, ...
-"""]]
-
-Subkey type 0 is used to handle hash collisions for the pathnames.
-They are unlikely, but it is entirely unacceptable for Obnam to fail
-if they happen. Each file has an identifier that is unique within the
-generation: there is a one-to-one mapping between fully qualified pathnames
-and file-ids within a generation (but not across generations).
-
-By default, the file-id is one half of
-the 128-bit MD5 of the full pathname to the file. To check for collisions,
-we see if the value for key (1, default-file-id, 0, file-id) is the full
-pathname for the file we are interested in. If it is not, we generate a
-new file-id by some deterministic approach, and do the lookup again,
-and repeat this until we find a free file-id.
-Note that the default-file-id in the lookup stays the same, it will
-always be the half of the MD5 of the pathname.
-
-Subkey type 1 is used to store a list of chunks that represent the contents
-of the file. Since a file can be quite large, we store more than one chunk-id
-per value, to reduce the overhead a bit. Benchmarking will determine the
-suitable number of chunk-ids to put in each value, but probably the size
-of the value should not be larger than about a quarter of a node.
-To avoid having to do a range lookup to find the next counter, when
-appending new chunk ids to an existing list, we store the number of items
-at subkey type 2 (subkey 0).
-
-Subkey type 2 is used for file metadata, such as inode data
-(`st_mtime`, `st_size`, etc), the name of the owner and group for
-the file, xattr fields, etc. To reduce space use, the most common
-metadata is stored in subkey value 0, encoded in a special format.
-Other subkey values are used for additional metadata, which also
-allows for future expansion.
-
-**For prefix 1,** the rest of the key is used like this:
-
-* main key is the chunk-id
-* subkey type is 0
-* subkey is file-id for file that uses the chunk
-
-For prefix 1, the value is always empty. Whenever Obnam backs up a
-file, it adds the key (1, chunk-id, 0, file-id). If the file existed
-in the previous generation, but not in the new generation, it removes
-the key.
-
-**For prefix 2,** the main key is fixed as the hash of
-the string "generation", and the subkey type is fixed at 0.
-The subkey is one of the values in the following table.
-
-[[!table data="""
-subkey |value
-0 |generation id
-1 |timestamp for the start of the generation
-2 |timestamp for the end of the generation
-3 |boolean (0 or 1) for whether the generation is a checkpoint
-"""]]
-
-Timestamps are expressed in UTC as seconds since the beginning of 1970
-(i.e, Unix epoch).
-
-Chunks, part 2
---------------
-
-When backing up, Obnam will do roughly this:
-
-* clone the tree for previous generation
-* traverse the filesystem tree, adding and removing files from the new tree
-* when backing up a file's chunks, look up each chunk in the store, and add it
- if it is missing; after this, the id of the chunk is known
-* add (1, chunk-id, 0, file-id) to the per-client forest,
- (chunk-id) to the chunk list forest,
- and (checksum, chunk-id, client-id) to the chunk checksum forest
- for any new chunks it uses; the updates to the chunk list and checksum
- forests can be batched to the end of the backup run
-* remove (1, chunk-id, 0, file-id) from the per-client forest,
- for every chunk for any file it removes from the new generation
- (no need to remove anything from the chunk checksum forest, since
- previous generations will still use the chunks)
-
-When removing a generation, Obnam will do roughly this:
-
-* look up every chunk-id used in that generation
-* look up each chunk-id in every other generation
-* if not found in any other generation, remove
- (checksum, chunk-id, client-id) from the chunk checksum forest;
- look up the checksum from the chunk list forest
-* if there are no keys in the range (checksum, chunk-id, 0) through
- (checksum, chunk-id, client_id_max) in the chunk checksum forest,
- then remove (chunk-id) from chunk list forest, and
- chunk file from disk
-* remove the tree for the unwanted generation
-
-
-Repository metadata
--------------------
-
-There is a need to store some metadata about the repository.
-This will be done in the `metadata` directory. Initially, the
-only piece of information is the version of the on-disk format,
-expressed as a single integer, stored in the file `metadata/format`.
-
-Each version of Obnam will know which on-disk formats it supports.
-With the `metadata/format` file it knows if it can handle a
-particular format.
-
-Upgrades to newer formats will happen explicitly, not implicitly.
-
-
-Encryption
-----------
-
-Obnam may optionally encrypt data in the backup store.
-The design for this is unclear, but will need to solve at least
-the following problems:
-
-* each client should have access only to its own data, but still
- allow sharing of data across clients
-* the server admin should be able to disable access for each client,
- or anyone who's compromised the client's encryption key
-* the server admin should be able to do some operations on any client,
- such as expire generations, remove entire clients and their corresponding
- chunks, verify correctness of the whole data store ("fsck"), or
- restore data after the client has been entirely lost
-
-
-On checksum algorithms
-----------------------
-
-This design uses MD5 checksums for everything.
-It does not require the checksum algorithm to be cryptographically
-secure, and is content with having a hash that is unlikely to result
-in collisions, and only cares about the probability of collisions
-for performance, not correctness.
-The design handles collisions, and does not assume two pieces of data
-are identical just because the hashes are the same.
-A shorter or faster checksum algorithm may be used instead.
-Or a cryptographically stronger one.
-MD5 is not recommended for applications that rely on the security of
-the hash, but this design does not rely on that.
-
-However, it may be quite expensive to verify that data chunks with
-the same checksum are the same. It may require reading back the chunk
-from the server. This may not be acceptable to all users. Some users
-may be willing to live with the chance of a checksum collision.
-Other users might not want to pay the price for de-duplication.
-Thus, it may be best for obnam to provide a tri-state option:
-full verification, rely on checksums, and no de-duplication. Further,
-it may be a good idea to allow the user to specify the checksum
-algorithm.
-
-
-Comments?
----------
-
-Edit this page directly, or mail me at <mailto:liw@liw.fi>. Thanks.
-
-
-Author
-------
-
-Obnam is being developed by [Lars Wirzenius](http://liw.fi).
-Feedback and patches are most welcome.
-
-
-References
-----------
-
-1. Rodeh, Ohad: "B-trees, Shadowing, and Clones". IBM Haifa Research Labs.
- <http://portal.acm.org/citation.cfm?id=1326544>
- <http://www.cs.tau.ac.il/~ohadrode/papers/btree_TOS.pdf>
-
-
-Changes
--------
-
-* 2010-05-04: Modified the way hash collisions for filenames are
- handled, based on suggestion from Richard Braakman: store secondary
- hash and filename with primary hash.
-* 2010-07-11: Minor tweaks.
-* 2010-11-11: Change tense to current, add some notes and minor fixes.
-* 2010-11-11: Add design for removal of chunks, some considerations for
- encryption, more details on how keys are used by various forests in
- the store, locking, and other clarifications.
-* 2010-11-11: Refer to hosts as clients, for clarity.
-* 2010-11-12: Typo fixes and checksum verification tri-state option
- suggestion from Richard Braakman.
-* 2010-11-18: Add generation metadata to per-client forest.
-* 2010-11-29: Add chunk list forest.