[[!meta title="Repository format 6"]] NOTE: This was originally written as the sole description of Obnam's on-disk repository storage. At the time, there was no other repository format, and so this text doesn't separate that which is generic from that which is specific. This page needs to be updated to discuss the specific parts only. For the generic parts, see [[ondisk]]. [[!toc startlevel=2 levels=2]] Obnam is a backup tool for making **backups to hard disks,** either local ones or over the network. It does not work with tape drives, or optical disks. The **backup server is stupid:** it does not run any Obnam specific code, only ssh (sftp). The location where backups are stored is called the **backup store**, and may be shared between multiple co-operating clients. The backup store resides on a backup server. The computers that own the data being backed up are the **clients**. Performance is obviously a consideration. It has many aspects: * run-time * RAM * disk space, both server and client * disk bandwidth, both server and client * network bandwidth, between server and client To achieve its performance goals, Obnam uses the B-tree variant devised by Ohad Rodeh[1], also used by btrfs. This document is describes how it does that. Characteristics of the B-tree ----------------------------- The **Rodeh B-trees** are designed for shadowing, i.e., **copy-on-write updates** of tree nodes. Nodes are not modified, instead a new copy is created to replace it. This allows an efficient way to copy a tree and to update the copy, while keeping the original tree intact. The two trees will share as many nodes as possible. Obnam calls the collection of related trees a **forest.** The trees use **fixed-length binary strings as keys.** For each key, a **variable-length binary string is stored as the value.** The tree consists of interior nodes and leaf nodes; all values are stored in leaves only. There is a maximum size for nodes, which in some applications would be a raw disk block, but is a separate file for Obnam. Value strings can be as big as a node (minus a small header), but no bigger. Lookups and removals from the tree can be done using **ranges:** all keys within a range are looked up, or removed from the tree. This is an important optimization and opens up some interesting design possibilities. File data (chunk) storage ------------------------- File data is not stored directly in B-trees, but externally. Data is broken up into **chunks** of suitable size (say, 64 KiB) and assigned identifiers, which also act as filenames. Given a chunk id, its contents can be easily retrieved. **Chunks are shared** within and across clients. This allows clients to avoid uploading data that another client has already uploaded, or storing the same data twice. The same mechanism allows Obnam to efficiently back up files that have moved, or that have been modified, or both at the same time. If a chunk of data already exists in the store, Obnam tries hard to avoid having to back it up again. To back up a chunk, Obnam chooses a random, unused 64-bit identifier, and uploads a file of that name. The next chunk uses the next identifier, until it hits one that has been used, at which point it picks a new random one. Chunks are managed using reference counts. We will cover these later, after discussing other details of the store. See section "Chunks, part 2" below. Overview of store forests ------------------------- The backup store consists of several forests: * client list: map client names to 64-bit identifiers * chunk list: map chunk identifiers to checksums * chunk checksums: map chunk checksum to identifiers * per-client data: backup generations for each client Locking ------- Locking is done only for writes. Reads are always allowed, even while write locks exist. This allows race conditions between readers and writers, but thanks to copy-on-write updates those are no different from files getting corrupted or deleted on the server by hardware failures, and can be treated the same. Each forest is locked separately. If more than one forest needs to be locked at the same time, they are locked in an order sorted by the name of the forest, using the C locale. If a client fails to lock a particular forest, it releases all the locks it holds, and tries again. This avoids deadlocks, but allows starving. The most likely reason for starving is too many clients sharing the same store, and that needs to be solved by increasing backup server performance, or having more stores. Locks are implemented as files, which are created atomically. Each lock file contains the name of the host that holds it (which might not be a backup client), and the process id on that client, and the time of creating the lock. If the time is very old, another client may decide to break the lock. If the backup store is encryped, then also the contents of the lock file is encrypted. To reduce lock congestion, each client attempts to keep a lock for as short a time as possible. For per-client data, this means keeping the lock for the duration of the backup. For shared forests, updates can be spooled: the shared forest is used read-only until the end of the backup run, or until a checkpoint, and updated then, as quickly as possible. Client list forest ---------------- The client list forest consists of a single tree. Its key is consists of a tuple: * 128 bits of MD5 checksum of the name of the client * 64 bits of client id (for hash collision avoidance) The client name is typically its host name, but might be anything. It is up to the sysadmins of the clients to co-operate so that each client has a unique name. It is unlikely that there will be checksum collisions for client names, but it's easy to not have to worry about that. The client id will be chosen randomly. Chunk list forest ----------------- The chunk list forest consists of a single tree. Its key consists of a 64-bit integer containing a chunk identifier. The value is the MD5 checksum for the chunk. This tree is needed so that it is possible to quickly find the checksum of a chunk, given its identifier, when removing generations from the per-client forest. Chunk checksum forest --------------------- The chunk checksum forest uses a tuple as a key: * 128-bit MD5 checksum of the data in the chunk * 64-bit chunk id * 64-bit client id The chunk id is used to handle checksum collisions. While MD5 collisions are not likely in general use, they are certain for people who research cryptographic hashes, so Obnam needs to be able to handle them. When Obnam has a chunk it wants to back up, it does a range lookup from (md5,0) through (md5,chunkd_id_max), and then checks if the chunk is identical to any of the chunks already in the store. This checking is expensive, but safety may more important than performance here. (Obnam may provide an option to disable the data comparison check, and rely only on the hashes.) The value is empty. Per-client forest --------------- The per-client forest has one tree per backup generation or checkpoint, and represents a snapshot of the filesystem at a specific time, or as close to a snapshot as Obnam can make it. Obnam does not freeze the filesystem while the backup runs, so it cannot guarantee a true snapshot. The forest uses a tuple as a key: * 8-bit prefix * 64-bit main key * 8-bit subkey type * 64-bit subkey The prefix is one of the following: * 0 for filesystem metadata * 1 for chunk references * 2 for generation metadata **For prefix 0,** the main key is the file identifier. See below for how they are generated. The subkey type and subkey and value are using as follows: [[!table data=""" subkey type | subkey | value 0 | file-id | full pathname to file 1 | counter 0... | up to N chunk-ids 2 | 0 | number of chunk-ids stored 3 | 0... | inode fields, user/group name, xattr, ... """]] Subkey type 0 is used to handle hash collisions for the pathnames. They are unlikely, but it is entirely unacceptable for Obnam to fail if they happen. Each file has an identifier that is unique within the generation: there is a one-to-one mapping between fully qualified pathnames and file-ids within a generation (but not across generations). By default, the file-id is one half of the 128-bit MD5 of the full pathname to the file. To check for collisions, we see if the value for key (1, default-file-id, 0, file-id) is the full pathname for the file we are interested in. If it is not, we generate a new file-id by some deterministic approach, and do the lookup again, and repeat this until we find a free file-id. Note that the default-file-id in the lookup stays the same, it will always be the half of the MD5 of the pathname. Subkey type 1 is used to store a list of chunks that represent the contents of the file. Since a file can be quite large, we store more than one chunk-id per value, to reduce the overhead a bit. Benchmarking will determine the suitable number of chunk-ids to put in each value, but probably the size of the value should not be larger than about a quarter of a node. To avoid having to do a range lookup to find the next counter, when appending new chunk ids to an existing list, we store the number of items at subkey type 2 (subkey 0). Subkey type 2 is used for file metadata, such as inode data (`st_mtime`, `st_size`, etc), the name of the owner and group for the file, xattr fields, etc. To reduce space use, the most common metadata is stored in subkey value 0, encoded in a special format. Other subkey values are used for additional metadata, which also allows for future expansion. **For prefix 1,** the rest of the key is used like this: * main key is the chunk-id * subkey type is 0 * subkey is file-id for file that uses the chunk For prefix 1, the value is always empty. Whenever Obnam backs up a file, it adds the key (1, chunk-id, 0, file-id). If the file existed in the previous generation, but not in the new generation, it removes the key. **For prefix 2,** the main key is fixed as the hash of the string "generation", and the subkey type is fixed at 0. The subkey is one of the values in the following table. [[!table data=""" subkey |value 0 |generation id 1 |timestamp for the start of the generation 2 |timestamp for the end of the generation 3 |boolean (0 or 1) for whether the generation is a checkpoint """]] Timestamps are expressed in UTC as seconds since the beginning of 1970 (i.e, Unix epoch). Chunks, part 2 -------------- When backing up, Obnam will do roughly this: * clone the tree for previous generation * traverse the filesystem tree, adding and removing files from the new tree * when backing up a file's chunks, look up each chunk in the store, and add it if it is missing; after this, the id of the chunk is known * add (1, chunk-id, 0, file-id) to the per-client forest, (chunk-id) to the chunk list forest, and (checksum, chunk-id, client-id) to the chunk checksum forest for any new chunks it uses; the updates to the chunk list and checksum forests can be batched to the end of the backup run * remove (1, chunk-id, 0, file-id) from the per-client forest, for every chunk for any file it removes from the new generation (no need to remove anything from the chunk checksum forest, since previous generations will still use the chunks) When removing a generation, Obnam will do roughly this: * look up every chunk-id used in that generation * look up each chunk-id in every other generation * if not found in any other generation, remove (checksum, chunk-id, client-id) from the chunk checksum forest; look up the checksum from the chunk list forest * if there are no keys in the range (checksum, chunk-id, 0) through (checksum, chunk-id, client_id_max) in the chunk checksum forest, then remove (chunk-id) from chunk list forest, and chunk file from disk * remove the tree for the unwanted generation Repository metadata ------------------- There is a need to store some metadata about the repository. This will be done in the `metadata` directory. Initially, the only piece of information is the version of the on-disk format, expressed as a single integer, stored in the file `metadata/format`. Each version of Obnam will know which on-disk formats it supports. With the `metadata/format` file it knows if it can handle a particular format. Upgrades to newer formats will happen explicitly, not implicitly. Encryption ---------- Obnam may optionally encrypt data in the backup store. The design for this is unclear, but will need to solve at least the following problems: * each client should have access only to its own data, but still allow sharing of data across clients * the server admin should be able to disable access for each client, or anyone who's compromised the client's encryption key * the server admin should be able to do some operations on any client, such as expire generations, remove entire clients and their corresponding chunks, verify correctness of the whole data store ("fsck"), or restore data after the client has been entirely lost On checksum algorithms ---------------------- This design uses MD5 checksums for everything. It does not require the checksum algorithm to be cryptographically secure, and is content with having a hash that is unlikely to result in collisions, and only cares about the probability of collisions for performance, not correctness. The design handles collisions, and does not assume two pieces of data are identical just because the hashes are the same. A shorter or faster checksum algorithm may be used instead. Or a cryptographically stronger one. MD5 is not recommended for applications that rely on the security of the hash, but this design does not rely on that. However, it may be quite expensive to verify that data chunks with the same checksum are the same. It may require reading back the chunk from the server. This may not be acceptable to all users. Some users may be willing to live with the chance of a checksum collision. Other users might not want to pay the price for de-duplication. Thus, it may be best for obnam to provide a tri-state option: full verification, rely on checksums, and no de-duplication. Further, it may be a good idea to allow the user to specify the checksum algorithm. Comments? --------- Edit this page directly, or mail me at . Thanks. Author ------ Obnam is being developed by [Lars Wirzenius](http://liw.fi). Feedback and patches are most welcome. References ---------- 1. Rodeh, Ohad: "B-trees, Shadowing, and Clones". IBM Haifa Research Labs. Changes ------- * 2010-05-04: Modified the way hash collisions for filenames are handled, based on suggestion from Richard Braakman: store secondary hash and filename with primary hash. * 2010-07-11: Minor tweaks. * 2010-11-11: Change tense to current, add some notes and minor fixes. * 2010-11-11: Add design for removal of chunks, some considerations for encryption, more details on how keys are used by various forests in the store, locking, and other clarifications. * 2010-11-11: Refer to hosts as clients, for clarity. * 2010-11-12: Typo fixes and checksum verification tri-state option suggestion from Richard Braakman. * 2010-11-18: Add generation metadata to per-client forest. * 2010-11-29: Add chunk list forest.