summaryrefslogtreecommitdiff
path: root/ondisk.mdwn
diff options
context:
space:
mode:
Diffstat (limited to 'ondisk.mdwn')
-rw-r--r--ondisk.mdwn212
1 files changed, 0 insertions, 212 deletions
diff --git a/ondisk.mdwn b/ondisk.mdwn
deleted file mode 100644
index 1e1afca..0000000
--- a/ondisk.mdwn
+++ /dev/null
@@ -1,212 +0,0 @@
-[[!meta title="Obnam on-disk data structures"]]
-
-Introduction
-============
-
-This page gives a high abstraction level picture of what the Obnam
-repository data structures are like, and the internal abstraction for
-handling them. It then links to more detailed descriptions of each
-different repository format.
-
-[[!toc levels=2]]
-
-
-Constraints and assumptions
-===========
-
-The repository design is influenced by several constraints and
-assumptions, some of which are described here.
-
-Disk-like storage
------------------
-
-Storage is assumed to be disk-like, meaning any piece of data can be
-accessed at about the same speed, rather than tape-like, where you'd
-have to basically only append data, and where seeks are not practical.
-
-Shared storage
---------------
-
-Multiple backup clients may share storage, for lower cost or easier
-administration.
-
-Dumb storage
-------------
-
-The most important of assumption is that the repository storage is
-dumb: it can't do any processing of data. Basically we must assume the
-repository provides only the following operations:
-
-* PUT some data in storage under a given filename, including a
- hierarchical directory structure. This is **atomic**, meaning the
- data is either completely available, or does not exist at all, under
- the given name. Stored data may be replaced completely, but it can't
- be updated only partly. PUT may optionally fail, if the name is
- already in use.
-* GET named data from storage.
-* LIST all data in storage in a given directory.
-* DELETE named data from storage.
-
-We can't, for example, request that the storage compute a checksum of
-some data. We especially can't assume Obnam itself is running on the
-machine providing the storage.
-
-Storage filesystem
-------------------
-
-The storage may be provided by any existing filesystem, including
-VFAT, or it might not be provided by a real filesystem at all. We
-can't use neat tricks like hard links to implement the repository.
-
-Round trip time
----------------
-
-The storage may be a local filesystem, but it may also be access over
-the network using some protocol such as SFTP. This means every storage
-access potentially carries a large time overhead. Minimising the
-number of separate accesses is necessary for good performance.
-
-Security and privacy
---------
-
-The repository design must assume an attacker has at least read-only
-access to the repository. This means the design should avoid leaking
-information via filenames, or other such things. Some data leak is
-unavoidable: it is, for example, unavoidable that an attacker can keep
-track of which files were changed when.
-
-
-Repository structure
-====================
-
-The repository is divided into four kinds of areas:
-
-* A list of clients.
-* Chunks of file content.
-* Indexes to find chunks efficiently.
-* Per-client data, such as what files each client has.
-
-These areas are mostly independent of each other. They refer to
-objects using identifiers: clients and chunks have identifiers that
-are random 64-bit integers, to avoid data leaks.
-
-
-Client list
------------
-
-Each backup client needs to add itself to the client list. The client
-list maps the client name to the client identifier, and also lists the
-client's encryption key.
-
-
-Chunks of file content
-----------------------
-
-Files vary in size a lot, and thus Obnam breaks file content into
-suitably small pieces, called chunks. Chunks can be re-used between
-files and clients, for de-duplication. Chunks may be of variable size.
-Chunks are accessed using their identifier only.
-
-
-Chunk indexes
--------------
-
-For de-duplication, it is necessary to know if a given piece of file
-content data is already stored in the repository. Chunk indexes
-provide a mapping from the value of a checksum algorithm to a list of
-chunk identifiers whose content has that checksum.
-
-Additionally, when removing backup generations (`obnam forget`), it is
-necessary to know which clients are using a given chunk. Chunk indexes
-also provide a mapping from a chunk identifier to a list of client
-identifiers.
-
-
-Per-client data
----------------
-
-Each client has its own files, and manages its own backup history. For
-each client the repository has its own area, where the client stores:
-
-* each backup generation
-* all the names and metadata (`stat`(2) results, etc) for each file
-* possibly some other data that is only relevant for that client
-
-
-Feature implementation
-======================
-
-Some of the headline features of Obnam need to be implementable using
-the repository design. This section describes how that happens.
-
-
-De-duplication
---------------
-
-De-duplication is implemented by the backup process reading a file and
-splitting the content up in chunks, using whatever chunking method it
-chooses. It then looks up the chunk in the chunk indexes to see if the
-chunk content is already in the repository.
-
-If there are chunks with the same checksum, the backup process can
-then either decide to re-use the chunks, on the assumption that the
-checksum is strong enough and that there are no collisions.
-Alternatively, it can download each of the chunks from the repository
-and compare the data bit by bit, to verify a match. The latter is
-quite expensive in time and bandwidth, but necessary for those who
-can't rely on checksums, such as those researching checksum
-collisions.
-
-Encryption
-----------
-
-Storage is dumb, and so it doesn't encrypt itself, or at least Obnam
-can't assume it does. Further, storage may be controlled by someone
-else, such as an online storage provider, and while they may be
-trusted to store data, they can't be trusted to not leak data. Thus,
-Obnam encrypts data before putting it in storage (assuming the user
-enables encryption).
-
-Encryption is done by combining symmetric and asymmetric (public key)
-encryption. For each area (client list, chunks, chunk indexes,
-per-client), a random symmetric encryption key is created, and all
-other files in the area are encrypted using that key. The symmetric
-key itself is encrypted using a list of public keys, which are
-provided by the user. The public keys are stored in a file in the
-area, and that file is encrypted using the symmetric key.
-
-This way, to access the public keys, you must be able to access the
-symmetric key, and you can only do that if you have one of the public
-keys. Or if you can break the encryption.
-
-
-Internal implementation details
-===============================
-
-Internally, in the Obnam code base, all repository formats are
-accessed using the same interface. This is so that Obnam can support
-any number of repository formats without excessive code complexity. In
-the code, this is in the `obnamlib/repo_interface.py` file. For
-details,
-[see that file](http://git.liw.fi/cgi-bin/cgit/cgit.cgi/obnam/tree/obnamlib/repo_interface.py).
-
-Additionally, all live data and repository storage is accessed using a
-filesystem abstraction, called the VFS interface. This interface is
-implemented separately for local filesystems and for SFTP servers, and
-this enables Obnam to access live data as well over SFTP. In the
-source tree, see `obnamlib/vfs.py` for the interface. Additional
-storage providers can be added by implementing the interface for them.
-
-
-Specific repository formats
-===========================
-
-Obnam implements several repository formats. Each format implements
-the abstract features described above in some concrete way. Depending
-on the quality of the implementation, the resulting format can work
-better or worse.
-
-* [[Format **6**|format-6]] was introduced prior to version 1.0, and
- is currently the main format. It is intended for real use.
-* Format [[Green Albatross|format-green-albatross]] is an experimental
- format, intended to eventually replace format 6.