[[!meta title="Obnam on-disk data structures"]] Introduction ============ This page gives a high abstraction level picture of what the Obnam repository data structures are like, and the internal abstraction for handling them. It then links to more detailed descriptions of each different repository format. [[!toc levels=2]] Constraints and assumptions =========== The repository design is influenced by several constraints and assumptions, some of which are described here. Disk-like storage ----------------- Storage is assumed to be disk-like, meaning any piece of data can be accessed at about the same speed, rather than tape-like, where you'd have to basically only append data, and where seeks are not practical. Shared storage -------------- Multiple backup clients may share storage, for lower cost or easier administration. Dumb storage ------------ The most important of assumption is that the repository storage is dumb: it can't do any processing of data. Basically we must assume the repository provides only the following operations: * PUT some data in storage under a given filename, including a hierarchical directory structure. This is **atomic**, meaning the data is either completely available, or does not exist at all, under the given name. Stored data may be replaced completely, but it can't be updated only partly. PUT may optionally fail, if the name is already in use. * GET named data from storage. * LIST all data in storage in a given directory. * DELETE named data from storage. We can't, for example, request that the storage compute a checksum of some data. We especially can't assume Obnam itself is running on the machine providing the storage. Storage filesystem ------------------ The storage may be provided by any existing filesystem, including VFAT, or it might not be provided by a real filesystem at all. We can't use neat tricks like hard links to implement the repository. Round trip time --------------- The storage may be a local filesystem, but it may also be access over the network using some protocol such as SFTP. This means every storage access potentially carries a large time overhead. Minimising the number of separate accesses is necessary for good performance. Security and privacy -------- The repository design must assume an attacker has at least read-only access to the repository. This means the design should avoid leaking information via filenames, or other such things. Some data leak is unavoidable: it is, for example, unavoidable that an attacker can keep track of which files were changed when. Repository structure ==================== The repository is divided into four kinds of areas: * A list of clients. * Chunks of file content. * Indexes to find chunks efficiently. * Per-client data, such as what files each client has. These areas are mostly independent of each other. They refer to objects using identifiers: clients and chunks have identifiers that are random 64-bit integers, to avoid data leaks. Client list ----------- Each backup client needs to add itself to the client list. The client list maps the client name to the client identifier, and also lists the client's encryption key. Chunks of file content ---------------------- Files vary in size a lot, and thus Obnam breaks file content into suitably small pieces, called chunks. Chunks can be re-used between files and clients, for de-duplication. Chunks may be of variable size. Chunks are accessed using their identifier only. Chunk indexes ------------- For de-duplication, it is necessary to know if a given piece of file content data is already stored in the repository. Chunk indexes provide a mapping from the value of a checksum algorithm to a list of chunk identifiers whose content has that checksum. Additionally, when removing backup generations (`obnam forget`), it is necessary to know which clients are using a given chunk. Chunk indexes also provide a mapping from a chunk identifier to a list of client identifiers. Per-client data --------------- Each client has its own files, and manages its own backup history. For each client the repository has its own area, where the client stores: * each backup generation * all the names and metadata (`stat`(2) results, etc) for each file * possibly some other data that is only relevant for that client Feature implementation ====================== Some of the headline features of Obnam need to be implementable using the repository design. This section describes how that happens. De-duplication -------------- De-duplication is implemented by the backup process reading a file and splitting the content up in chunks, using whatever chunking method it chooses. It then looks up the chunk in the chunk indexes to see if the chunk content is already in the repository. If there are chunks with the same checksum, the backup process can then either decide to re-use the chunks, on the assumption that the checksum is strong enough and that there are no collisions. Alternatively, it can download each of the chunks from the repository and compare the data bit by bit, to verify a match. The latter is quite expensive in time and bandwidth, but necessary for those who can't rely on checksums, such as those researching checksum collisions. Encryption ---------- Storage is dumb, and so it doesn't encrypt itself, or at least Obnam can't assume it does. Further, storage may be controlled by someone else, such as an online storage provider, and while they may be trusted to store data, they can't be trusted to not leak data. Thus, Obnam encrypts data before putting it in storage (assuming the user enables encryption). Encryption is done by combining symmetric and asymmetric (public key) encryption. For each area (client list, chunks, chunk indexes, per-client), a random symmetric encryption key is created, and all other files in the area are encrypted using that key. The symmetric key itself is encrypted using a list of public keys, which are provided by the user. The public keys are stored in a file in the area, and that file is encrypted using the symmetric key. This way, to access the public keys, you must be able to access the symmetric key, and you can only do that if you have one of the public keys. Or if you can break the encryption. Internal implementation details =============================== Internally, in the Obnam code base, all repository formats are accessed using the same interface. This is so that Obnam can support any number of repository formats without excessive code complexity. In the code, this is in the `obnamlib/repo_interface.py` file. For details, [see that file](http://git.liw.fi/cgi-bin/cgit/cgit.cgi/obnam/tree/obnamlib/repo_interface.py). Additionally, all live data and repository storage is accessed using a filesystem abstraction, called the VFS interface. This interface is implemented separately for local filesystems and for SFTP servers, and this enables Obnam to access live data as well over SFTP. In the source tree, see `obnamlib/vfs.py` for the interface. Additional storage providers can be added by implementing the interface for them. Specific repository formats =========================== Obnam implements several repository formats. Each format implements the abstract features described above in some concrete way. Depending on the quality of the implementation, the resulting format can work better or worse. * [[Format **6**|format-6]] was introduced prior to version 1.0, and is currently the main format. It is intended for real use. * Format [[Green Albatross|format-green-albatross]] is an experimental format, intended to eventually replace format 6.