diff options
Diffstat (limited to 'ondisk.mdwn')
-rw-r--r-- | ondisk.mdwn | 212 |
1 files changed, 0 insertions, 212 deletions
diff --git a/ondisk.mdwn b/ondisk.mdwn deleted file mode 100644 index 1e1afca..0000000 --- a/ondisk.mdwn +++ /dev/null @@ -1,212 +0,0 @@ -[[!meta title="Obnam on-disk data structures"]] - -Introduction -============ - -This page gives a high abstraction level picture of what the Obnam -repository data structures are like, and the internal abstraction for -handling them. It then links to more detailed descriptions of each -different repository format. - -[[!toc levels=2]] - - -Constraints and assumptions -=========== - -The repository design is influenced by several constraints and -assumptions, some of which are described here. - -Disk-like storage ------------------ - -Storage is assumed to be disk-like, meaning any piece of data can be -accessed at about the same speed, rather than tape-like, where you'd -have to basically only append data, and where seeks are not practical. - -Shared storage --------------- - -Multiple backup clients may share storage, for lower cost or easier -administration. - -Dumb storage ------------- - -The most important of assumption is that the repository storage is -dumb: it can't do any processing of data. Basically we must assume the -repository provides only the following operations: - -* PUT some data in storage under a given filename, including a - hierarchical directory structure. This is **atomic**, meaning the - data is either completely available, or does not exist at all, under - the given name. Stored data may be replaced completely, but it can't - be updated only partly. PUT may optionally fail, if the name is - already in use. -* GET named data from storage. -* LIST all data in storage in a given directory. -* DELETE named data from storage. - -We can't, for example, request that the storage compute a checksum of -some data. We especially can't assume Obnam itself is running on the -machine providing the storage. - -Storage filesystem ------------------- - -The storage may be provided by any existing filesystem, including -VFAT, or it might not be provided by a real filesystem at all. We -can't use neat tricks like hard links to implement the repository. - -Round trip time ---------------- - -The storage may be a local filesystem, but it may also be access over -the network using some protocol such as SFTP. This means every storage -access potentially carries a large time overhead. Minimising the -number of separate accesses is necessary for good performance. - -Security and privacy --------- - -The repository design must assume an attacker has at least read-only -access to the repository. This means the design should avoid leaking -information via filenames, or other such things. Some data leak is -unavoidable: it is, for example, unavoidable that an attacker can keep -track of which files were changed when. - - -Repository structure -==================== - -The repository is divided into four kinds of areas: - -* A list of clients. -* Chunks of file content. -* Indexes to find chunks efficiently. -* Per-client data, such as what files each client has. - -These areas are mostly independent of each other. They refer to -objects using identifiers: clients and chunks have identifiers that -are random 64-bit integers, to avoid data leaks. - - -Client list ------------ - -Each backup client needs to add itself to the client list. The client -list maps the client name to the client identifier, and also lists the -client's encryption key. - - -Chunks of file content ----------------------- - -Files vary in size a lot, and thus Obnam breaks file content into -suitably small pieces, called chunks. Chunks can be re-used between -files and clients, for de-duplication. Chunks may be of variable size. -Chunks are accessed using their identifier only. - - -Chunk indexes -------------- - -For de-duplication, it is necessary to know if a given piece of file -content data is already stored in the repository. Chunk indexes -provide a mapping from the value of a checksum algorithm to a list of -chunk identifiers whose content has that checksum. - -Additionally, when removing backup generations (`obnam forget`), it is -necessary to know which clients are using a given chunk. Chunk indexes -also provide a mapping from a chunk identifier to a list of client -identifiers. - - -Per-client data ---------------- - -Each client has its own files, and manages its own backup history. For -each client the repository has its own area, where the client stores: - -* each backup generation -* all the names and metadata (`stat`(2) results, etc) for each file -* possibly some other data that is only relevant for that client - - -Feature implementation -====================== - -Some of the headline features of Obnam need to be implementable using -the repository design. This section describes how that happens. - - -De-duplication --------------- - -De-duplication is implemented by the backup process reading a file and -splitting the content up in chunks, using whatever chunking method it -chooses. It then looks up the chunk in the chunk indexes to see if the -chunk content is already in the repository. - -If there are chunks with the same checksum, the backup process can -then either decide to re-use the chunks, on the assumption that the -checksum is strong enough and that there are no collisions. -Alternatively, it can download each of the chunks from the repository -and compare the data bit by bit, to verify a match. The latter is -quite expensive in time and bandwidth, but necessary for those who -can't rely on checksums, such as those researching checksum -collisions. - -Encryption ----------- - -Storage is dumb, and so it doesn't encrypt itself, or at least Obnam -can't assume it does. Further, storage may be controlled by someone -else, such as an online storage provider, and while they may be -trusted to store data, they can't be trusted to not leak data. Thus, -Obnam encrypts data before putting it in storage (assuming the user -enables encryption). - -Encryption is done by combining symmetric and asymmetric (public key) -encryption. For each area (client list, chunks, chunk indexes, -per-client), a random symmetric encryption key is created, and all -other files in the area are encrypted using that key. The symmetric -key itself is encrypted using a list of public keys, which are -provided by the user. The public keys are stored in a file in the -area, and that file is encrypted using the symmetric key. - -This way, to access the public keys, you must be able to access the -symmetric key, and you can only do that if you have one of the public -keys. Or if you can break the encryption. - - -Internal implementation details -=============================== - -Internally, in the Obnam code base, all repository formats are -accessed using the same interface. This is so that Obnam can support -any number of repository formats without excessive code complexity. In -the code, this is in the `obnamlib/repo_interface.py` file. For -details, -[see that file](http://git.liw.fi/cgi-bin/cgit/cgit.cgi/obnam/tree/obnamlib/repo_interface.py). - -Additionally, all live data and repository storage is accessed using a -filesystem abstraction, called the VFS interface. This interface is -implemented separately for local filesystems and for SFTP servers, and -this enables Obnam to access live data as well over SFTP. In the -source tree, see `obnamlib/vfs.py` for the interface. Additional -storage providers can be added by implementing the interface for them. - - -Specific repository formats -=========================== - -Obnam implements several repository formats. Each format implements -the abstract features described above in some concrete way. Depending -on the quality of the implementation, the resulting format can work -better or worse. - -* [[Format **6**|format-6]] was introduced prior to version 1.0, and - is currently the main format. It is intended for real use. -* Format [[Green Albatross|format-green-albatross]] is an experimental - format, intended to eventually replace format 6. |