diff options
Diffstat (limited to 'format-6.mdwn')
-rw-r--r-- | format-6.mdwn | 403 |
1 files changed, 0 insertions, 403 deletions
diff --git a/format-6.mdwn b/format-6.mdwn deleted file mode 100644 index 7acfe0f..0000000 --- a/format-6.mdwn +++ /dev/null @@ -1,403 +0,0 @@ -[[!meta title="Repository format 6"]] - -NOTE: This was originally written as the sole description of Obnam's -on-disk repository storage. At the time, there was no other repository -format, and so this text doesn't separate that which is generic from -that which is specific. This page needs to be updated to discuss the -specific parts only. For the generic parts, see [[ondisk]]. - -[[!toc startlevel=2 levels=2]] - -Obnam is a backup tool for making **backups to hard disks,** either -local ones or over the network. It does not work with tape drives, -or optical disks. The **backup server is stupid:** it does -not run any Obnam specific code, only ssh (sftp). - -The location where -backups are stored is called the **backup store**, and may be shared between -multiple co-operating clients. The backup store resides on a backup server. The -computers that own the data being backed up are the **clients**. - -Performance is obviously a consideration. It has many aspects: - -* run-time -* RAM -* disk space, both server and client -* disk bandwidth, both server and client -* network bandwidth, between server and client - -To achieve its performance goals, -Obnam uses the B-tree variant devised by Ohad Rodeh[1], also -used by btrfs. This document is describes how it does that. - - -Characteristics of the B-tree ------------------------------ - -The **Rodeh B-trees** are designed for shadowing, i.e., -**copy-on-write updates** -of tree nodes. Nodes are not modified, instead a new copy is created -to replace it. This allows an efficient way to copy a tree and to update -the copy, while keeping the original tree intact. -The two trees will share as many nodes as possible. -Obnam calls the collection of related trees a **forest.** - -The trees use **fixed-length binary strings as keys.** -For each key, a **variable-length binary string is stored as the value.** -The tree consists of interior nodes and leaf -nodes; all values are stored in leaves only. There is a maximum size for -nodes, which in some applications would be a raw disk block, but is -a separate file for Obnam. -Value strings can be as big as a node (minus a small header), but no bigger. - -Lookups and removals from the tree can be done using **ranges:** all keys -within a range are looked up, or removed from the tree. This is an important -optimization and opens up some interesting design possibilities. - - -File data (chunk) storage -------------------------- - -File data is not stored directly in B-trees, but externally. Data -is broken up into **chunks** of suitable size (say, 64 KiB) -and assigned identifiers, -which also act as filenames. Given a chunk id, its contents can be easily -retrieved. - -**Chunks are shared** within and across clients. -This allows clients to avoid -uploading data that another client has already uploaded, -or storing the same data twice. The same mechanism allows Obnam to -efficiently back up files that have moved, or that have been modified, -or both at the same time. -If a chunk of data already exists in the store, Obnam tries hard to avoid -having to back it up again. - -To back up a chunk, Obnam chooses a random, unused 64-bit identifier, -and uploads a file of that name. The next chunk uses the next -identifier, until it hits one that has been used, at which point -it picks a new random one. - -Chunks are managed using reference counts. We will cover these later, -after discussing other details of the store. See -section "Chunks, part 2" below. - -Overview of store forests -------------------------- - -The backup store consists of several forests: - -* client list: map client names to 64-bit identifiers -* chunk list: map chunk identifiers to checksums -* chunk checksums: map chunk checksum to identifiers -* per-client data: backup generations for each client - -Locking -------- - -Locking is done only for writes. -Reads are always allowed, even while write locks exist. -This allows race conditions between readers and writers, -but thanks to copy-on-write updates -those are no different from files getting corrupted or -deleted on the server by hardware failures, -and can be treated the same. - -Each forest is locked separately. -If more than one forest needs to be locked at the same time, -they are locked in an order sorted by the name of the forest, -using the C locale. -If a client fails to lock a particular forest, -it releases all the locks it holds, -and tries again. -This avoids deadlocks, -but allows starving. -The most likely reason for starving is too many clients sharing the -same store, and that needs to be solved by increasing backup server -performance, or having more stores. - -Locks are implemented as files, which are created atomically. -Each lock file contains the name of the host that holds it -(which might not be a backup client), -and the process id on that client, and the time of creating the lock. -If the time is very old, another client may decide to break the lock. -If the backup store is encryped, then also the contents of the -lock file is encrypted. - -To reduce lock congestion, each client attempts to keep a lock for as -short a time as possible. For per-client data, this means keeping the lock -for the duration of the backup. For shared forests, updates can be -spooled: the shared forest is used read-only until the end of the backup -run, or until a checkpoint, and updated then, as quickly as possible. - -Client list forest ----------------- - -The client list forest consists of a single tree. -Its key is consists of a tuple: - -* 128 bits of MD5 checksum of the name of the client -* 64 bits of client id (for hash collision avoidance) - -The client name is typically its host name, but might be anything. -It is up to the sysadmins of the clients to co-operate so that each -client has a unique name. - -It is unlikely that there will be checksum collisions for client names, -but it's easy to not have to worry about that. The client id will be -chosen randomly. - -Chunk list forest ------------------ - -The chunk list forest consists of a single tree. -Its key consists of a 64-bit integer containing a chunk identifier. -The value is the MD5 checksum for the chunk. - -This tree is needed so that it is possible to quickly find the -checksum of a chunk, given its identifier, when removing generations -from the per-client forest. - -Chunk checksum forest ---------------------- - -The chunk checksum forest uses a tuple as a key: - -* 128-bit MD5 checksum of the data in the chunk -* 64-bit chunk id -* 64-bit client id - -The chunk id is used to handle checksum collisions. -While MD5 collisions are not likely in general use, -they are certain for people who research cryptographic hashes, -so Obnam needs to be able to handle them. -When Obnam has a chunk it wants to back up, -it does a range lookup from (md5,0) through (md5,chunkd_id_max), -and then checks if the chunk is identical to any of the chunks -already in the store. -This checking is expensive, but safety may more important than -performance here. (Obnam may provide an option to disable the -data comparison check, and rely only on the hashes.) - -The value is empty. - -Per-client forest ---------------- - -The per-client forest has one tree per backup generation or checkpoint, -and represents a snapshot of the filesystem at a specific time, -or as close to a snapshot as Obnam can make it. Obnam does not -freeze the filesystem while the backup runs, so it cannot guarantee -a true snapshot. - -The forest uses a tuple as a key: - -* 8-bit prefix -* 64-bit main key -* 8-bit subkey type -* 64-bit subkey - -The prefix is one of the following: - -* 0 for filesystem metadata -* 1 for chunk references -* 2 for generation metadata - -**For prefix 0,** the main key is the file identifier. -See below for how they are generated. - -The subkey type and subkey and value are using as follows: - -[[!table data=""" -subkey type | subkey | value -0 | file-id | full pathname to file -1 | counter 0... | up to N chunk-ids -2 | 0 | number of chunk-ids stored -3 | 0... | inode fields, user/group name, xattr, ... -"""]] - -Subkey type 0 is used to handle hash collisions for the pathnames. -They are unlikely, but it is entirely unacceptable for Obnam to fail -if they happen. Each file has an identifier that is unique within the -generation: there is a one-to-one mapping between fully qualified pathnames -and file-ids within a generation (but not across generations). - -By default, the file-id is one half of -the 128-bit MD5 of the full pathname to the file. To check for collisions, -we see if the value for key (1, default-file-id, 0, file-id) is the full -pathname for the file we are interested in. If it is not, we generate a -new file-id by some deterministic approach, and do the lookup again, -and repeat this until we find a free file-id. -Note that the default-file-id in the lookup stays the same, it will -always be the half of the MD5 of the pathname. - -Subkey type 1 is used to store a list of chunks that represent the contents -of the file. Since a file can be quite large, we store more than one chunk-id -per value, to reduce the overhead a bit. Benchmarking will determine the -suitable number of chunk-ids to put in each value, but probably the size -of the value should not be larger than about a quarter of a node. -To avoid having to do a range lookup to find the next counter, when -appending new chunk ids to an existing list, we store the number of items -at subkey type 2 (subkey 0). - -Subkey type 2 is used for file metadata, such as inode data -(`st_mtime`, `st_size`, etc), the name of the owner and group for -the file, xattr fields, etc. To reduce space use, the most common -metadata is stored in subkey value 0, encoded in a special format. -Other subkey values are used for additional metadata, which also -allows for future expansion. - -**For prefix 1,** the rest of the key is used like this: - -* main key is the chunk-id -* subkey type is 0 -* subkey is file-id for file that uses the chunk - -For prefix 1, the value is always empty. Whenever Obnam backs up a -file, it adds the key (1, chunk-id, 0, file-id). If the file existed -in the previous generation, but not in the new generation, it removes -the key. - -**For prefix 2,** the main key is fixed as the hash of -the string "generation", and the subkey type is fixed at 0. -The subkey is one of the values in the following table. - -[[!table data=""" -subkey |value -0 |generation id -1 |timestamp for the start of the generation -2 |timestamp for the end of the generation -3 |boolean (0 or 1) for whether the generation is a checkpoint -"""]] - -Timestamps are expressed in UTC as seconds since the beginning of 1970 -(i.e, Unix epoch). - -Chunks, part 2 --------------- - -When backing up, Obnam will do roughly this: - -* clone the tree for previous generation -* traverse the filesystem tree, adding and removing files from the new tree -* when backing up a file's chunks, look up each chunk in the store, and add it - if it is missing; after this, the id of the chunk is known -* add (1, chunk-id, 0, file-id) to the per-client forest, - (chunk-id) to the chunk list forest, - and (checksum, chunk-id, client-id) to the chunk checksum forest - for any new chunks it uses; the updates to the chunk list and checksum - forests can be batched to the end of the backup run -* remove (1, chunk-id, 0, file-id) from the per-client forest, - for every chunk for any file it removes from the new generation - (no need to remove anything from the chunk checksum forest, since - previous generations will still use the chunks) - -When removing a generation, Obnam will do roughly this: - -* look up every chunk-id used in that generation -* look up each chunk-id in every other generation -* if not found in any other generation, remove - (checksum, chunk-id, client-id) from the chunk checksum forest; - look up the checksum from the chunk list forest -* if there are no keys in the range (checksum, chunk-id, 0) through - (checksum, chunk-id, client_id_max) in the chunk checksum forest, - then remove (chunk-id) from chunk list forest, and - chunk file from disk -* remove the tree for the unwanted generation - - -Repository metadata -------------------- - -There is a need to store some metadata about the repository. -This will be done in the `metadata` directory. Initially, the -only piece of information is the version of the on-disk format, -expressed as a single integer, stored in the file `metadata/format`. - -Each version of Obnam will know which on-disk formats it supports. -With the `metadata/format` file it knows if it can handle a -particular format. - -Upgrades to newer formats will happen explicitly, not implicitly. - - -Encryption ----------- - -Obnam may optionally encrypt data in the backup store. -The design for this is unclear, but will need to solve at least -the following problems: - -* each client should have access only to its own data, but still - allow sharing of data across clients -* the server admin should be able to disable access for each client, - or anyone who's compromised the client's encryption key -* the server admin should be able to do some operations on any client, - such as expire generations, remove entire clients and their corresponding - chunks, verify correctness of the whole data store ("fsck"), or - restore data after the client has been entirely lost - - -On checksum algorithms ----------------------- - -This design uses MD5 checksums for everything. -It does not require the checksum algorithm to be cryptographically -secure, and is content with having a hash that is unlikely to result -in collisions, and only cares about the probability of collisions -for performance, not correctness. -The design handles collisions, and does not assume two pieces of data -are identical just because the hashes are the same. -A shorter or faster checksum algorithm may be used instead. -Or a cryptographically stronger one. -MD5 is not recommended for applications that rely on the security of -the hash, but this design does not rely on that. - -However, it may be quite expensive to verify that data chunks with -the same checksum are the same. It may require reading back the chunk -from the server. This may not be acceptable to all users. Some users -may be willing to live with the chance of a checksum collision. -Other users might not want to pay the price for de-duplication. -Thus, it may be best for obnam to provide a tri-state option: -full verification, rely on checksums, and no de-duplication. Further, -it may be a good idea to allow the user to specify the checksum -algorithm. - - -Comments? ---------- - -Edit this page directly, or mail me at <mailto:liw@liw.fi>. Thanks. - - -Author ------- - -Obnam is being developed by [Lars Wirzenius](http://liw.fi). -Feedback and patches are most welcome. - - -References ----------- - -1. Rodeh, Ohad: "B-trees, Shadowing, and Clones". IBM Haifa Research Labs. - <http://portal.acm.org/citation.cfm?id=1326544> - <http://www.cs.tau.ac.il/~ohadrode/papers/btree_TOS.pdf> - - -Changes -------- - -* 2010-05-04: Modified the way hash collisions for filenames are - handled, based on suggestion from Richard Braakman: store secondary - hash and filename with primary hash. -* 2010-07-11: Minor tweaks. -* 2010-11-11: Change tense to current, add some notes and minor fixes. -* 2010-11-11: Add design for removal of chunks, some considerations for - encryption, more details on how keys are used by various forests in - the store, locking, and other clarifications. -* 2010-11-11: Refer to hosts as clients, for clarity. -* 2010-11-12: Typo fixes and checksum verification tri-state option - suggestion from Richard Braakman. -* 2010-11-18: Add generation metadata to per-client forest. -* 2010-11-29: Add chunk list forest. |