From e1c4683b73ec2a207321377636f4ed722d0674dc Mon Sep 17 00:00:00 2001 From: Lars Wirzenius Date: Fri, 18 Sep 2020 19:59:41 +0300 Subject: doc(subplot): flesh out the subplot Add a lot of text. This is initially meant for an MVP. --- obnam.md | 299 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++-- 1 file changed, 294 insertions(+), 5 deletions(-) (limited to 'obnam.md') diff --git a/obnam.md b/obnam.md index 051bfc0..7caf0aa 100644 --- a/obnam.md +++ b/obnam.md @@ -1,13 +1,245 @@ -# Acceptance criteria +# Introduction + +Obnam2 is a project to develop a backup system. + +In 2004 I started a project to develop a backup program for myself, +which in 2006 I named Obnam. In 2017 I retired the project, because it +was no longer fun. The project had some long-standing, architectural +issues related to performance that had become entrenched and were hard +to fix, without breaking backwards compatibility. + +In 2020, with Obnam2 I'm starting over from scratch. The new software +is not, and will not become, compatible with Obnam1 in any way. I aim +the new software to be more reliable and faster than Obnam1, without +sacrificing security or ease of use, while being maintainable in the +long run. + +Part of that maintainability is going to be achieved by using Rust as +the programming language (strong, static type system) rather than +Python (dynamic, comparatively weak type system). Another part is more +strongly aiming for simplicity and elegance. Obnam1 used an elegant, +but not very simple copy-on-write B-tree structure; Obnam2 will at +least initially use [SQLite][]. + +[SQLite]: https://sqlite.org/index.html + +## Glossary + +This document uses some specific terminology related to backups. Here +is a glossary of such terms. + +* **chunk** is a relatively small amount of live data or metadata + about live data, as chosen by the client +* **client** is the computer system where the live data lives, also the part of + Obnam2 running on that computer +* **generation** is a snapshot of live data +* **live data** is the data that gets backed up +* **repository** is where the backups get stored +* **server** is the computer system where the repository resides, also + the part of Obnam2 running on that computer + + +# Requirements + +The following high-level requirements are not meant to be verifiable +in an automated way: + +* _Not done:_ **Easy to install:** available as a Debian package in an + APT repository. +* _Not done:_ **Easy to configure:** only need to configure things + that are inherently specific to a client and for which sensible + defaults are impossible. +* _Not done:_ **Easy to run:** a single command line that's always the + same works for making a backup. +* _Not done:_ **Detects corruption:** if a file in the repository is + modified or deleted, the software notices it automatically. +* _Not done:_ **Corrects any 1-bit error:** if a file in the + repository is changed by one bit, the software automatically + corrects it. +* _Not done:_ **Repository is encrypted:** all data stored in the + repository is encrypted with a key only the client has. +* _Not done:_ **Fast backups and restores:** when a client and server + both have sufficient CPU, RAM, and disk bandwidth, the software make + a backup or restore a backup over a gigabit Ethernet using at least + 50% of the network bandwidth. +* _Not done:_ **Snapshots:** Each backup generation is an independent + snapshot: it can be deleted without affecting any other generation. +* _Not done:_ **Deduplication:** Identical chunks of data are stored + only once in the backup repository. +* _Not done:_ **Compressed:** Data stored in the backup repository is + compressed. +* _Not done:_ **Large numbers of live data files:** The system must + handle ten million files in live data. +* _Not done:_ **Live data in the terabyte range:** The system must + handle a terabyte of live data. +* _Not done:_ **Many clients:** The system must handle a thousand + total clients and one hundred clients using the server concurrently. +* _Not done:_ **Shared repository:** The system should allow people + who don't trust each other to share a repository without fearing + their own data leaks, or even its existence leaks, to anyone. +* _Not done:_ **Shared backups:** People who do trust each other + should be able to share backed up data in the repository. + +The detailed, automatically verified acceptance criteria are +documented in the ["Acceptance criteria"](#acceptance) chapter. + + +## Requirements for a minimum viable product + +The first milestone for the Obnam2 – the minimum viable product +– does not try to fulfil all the requirements for Obnam2. +Instead, the following semi-subset is the goal: + +* _Not done:_ **Can do a backup my own data and restore** This is the + minimum functionality for a backup program. +* _Not done:_ **Fast backups and restores:** a backup or restore of 10 + GiB of live data, between two VMs on my big home server take less + than 200 seconds. +* _Not done:_ **Snapshots:** Each backup generation is an independent + snapshot: it can be deleted without affecting any other generation. +* _Not done:_ **Deduplication:** Identical files are stored only once + in the backup repository. +* _Not done:_ **Single client:** Only a single client per server is + supported. +* _Not done:_ **No authentication:** The client does note authenticate + itself to the server. +* _Not done:_ **No encryption:** Client sends data to the server in + cleartext. + +This document currently only documents the detailed acceptance +criteria for the MVP. When the MVP is finished, this document will +start documenting more. + + +# Architecture + +For the minimum viable product, Obnam2 will be split into a server and +one or more clients. The server handles storage of chunks, and access +control to them. The clients make and restore backups. The +communication between the clients and the server is via HTTP. + +~~~dot +digraph "arch" { + live1 -> client1; + live2 -> client2; + live3 -> client3; + live4 -> client4; + live5 -> client5; + client1 -> server [label="HTTP"]; + client2 -> server; + client3 -> server; + client4 -> server; + client5 -> server; + server -> disk; + live1 [shape=cylinder] + live2 [shape=cylinder] + live3 [shape=cylinder] + live4 [shape=cylinder] + live5 [shape=cylinder] + disk [shape=cylinder] +} +~~~ + +The server side is not very smart. I handles storage of chunks and +their metadata only. The client is smarter: + +* it scans live data for files to back up +* it splits those files into chunks, and stores the chunks in the + server +* it constructs an SQLite database file, with all filenames, file + metadata, and the chunks associated with each live data file +* it stores the database on the server, as chunks +* it stores a chunk specially marked as a generation on the server + +The generation chunk contains a list of the chunks for the SQLite +database. When the client needs to restore data: + +* it gets a list of generation chunks from the server +* it lets the user choose a generation +* it downloads the generation chunk, and the associated SQLite + database, and then all the backed up files, as listed in the + database + +This is the simplest architecture I can think of for the MVP. + +## Chunk server API + +The chunk server has the following API: + +* `POST /chunks` – store a new chunk (and its metadata) on the + server, return its randomly chosen identifier +* `GET /chunks/` – retrieve a chunk (and its metadata) from + the server, given a chunk identifier +* `GET /chunks?sha256=xyzzy` – find chunks on the server whose + metadata indicates their contents has a given SHA256 checksum +* `GET /chunks?generation=true` – find generation chunks + +When creating or retrieving a chunk, its metadata is carried in a +`Chunk-Meta` header as a JSON object. The following keys are allowed: + +* `sha256` – the SHA256 checksum of the chunk contents as + determined by the client + - this must be set for every chunk, including generation chunks + - note that the server doesn't verify this in any way +* `generation` – set to `true` if the chunk represents a + generation + - may also be set to `false` or `null` or be missing entirely +* `ended` – the timestamp of when the backup generation ended + - note that the server doesn't process this in anyway, the contents + is entirely up to the client + - may be set to the empty string, `null`, or be missing entirely + +HTTP status codes are used to indicate if a request succeeded or not, +using the customary meanings. + +When creating a chunk, chunk's metadata is sent in the `Chunk-Meta` +header, and the contents in the request body. The new chunk gets a +randomly assigned identifier, and if the request is successful, the +response is a JSON object with the identifier: + +~~~json +{ + "chunk_id": "fe20734b-edb3-432f-83c3-d35fe15969dd" +} +~~~ + +The identifier is a [UUID4][], but the client should not assume that. + +[UUID4]: https://en.wikipedia.org/wiki/Universally_unique_identifier#Version_4_(random) + +When a chunk is retrieved, the chunk metadata is returned in the +`Chunk-Meta` header, and the contents in the response body. + +Note that it is not possible to update a chunk or its metadata. + +When searching for chunks, any matching chunk's identifiers and +metadata are returned in a JSON object: + +~~~json +{ + "fe20734b-edb3-432f-83c3-d35fe15969dd": { + "sha256": "09ca7e4eaa6e8ae9c7d261167129184883644d07dfba7cbfbc4c8a2e08360d5b", + "generation": null, + "ended: null, + } +} +~~~ + +There can be any number of chunks in the response. + +# Acceptance criteria {#acceptance} [Subplot]: https://subplot.liw.fi/ This chapter documents detailed acceptance criteria and how they are -verified as scenarios for the [Subplot][] tool +verified as scenarios for the [Subplot][] tool. At this time, only +criteria for the minimum viable product are included. ## Chunk server -These scenarios verify that the chunk server works. +These scenarios verify that the chunk server works on its own. The +scenarios start a fresh, empty chunk server, and do some operations on +it, and verify the results, and finally terminate the server. ### Chunk management @@ -16,7 +248,8 @@ retrieved, with its metadata, and then deleted. The chunk server has an API with just one endpoint, `/chunks`, and accepts the the POST, GET, and DELETE operations on it. -To create a chunk, we use POST. +To create a chunk, we use POST. We remember the identifier so we can +retrieve the chunk later. ~~~scenario given a chunk server @@ -27,7 +260,7 @@ and content-type is application/json and the JSON body has a field chunk_id, henceforth ID ~~~ -To retrieve a chunk, we use GET, giving the chunk id in the path. +To retrieve a chunk, we use GET. ~~~scenario when I GET /chunks/ @@ -37,6 +270,59 @@ and chunk-meta is {"sha256":"abc","generation":null,"ended":null} and the body matches file data.dat ~~~ +TODO: fetch non-existent chunk + +TODO: delete chunk + + +## Smoke test + +This scenario verifies that a small amount of data in simple files in +one directory can be backed up and restored, and the restored files +and their metadata are identical to the original. This is the simplest +possible, but still useful requirement for a backup system. + +## Backups and restores + +These scenarios verify that every kind of file system object can be +backed up and restored. + +### All kinds of files and metadata + +This scenario verifies that all kinds of files (regular, hard link, +symbolic link, directory, etc) and metadata can be backed up and +restored. + +### Duplicate files are stored once + +This scenario verifies that if the live data has two copies of the +same file, it is stored only once. + +### Snapshots are independent + +This scenario verifies that generation snapshots are independent of +each other, by making three backup generations, deleting the middle +one, and restoring the others. + + +## Performance + +These scenarios verify that system performance is at an expected +level, at least in simple cases. To keep the implementation of the +scenario manageable, communication is over `localhost`, not between +hosts. A more thorough benchmark suite will need to be implemented +separately. + +### Can back up 10 GiB in 200 seconds + +This scenario verifies that the system can back up data at an +acceptable speed. + +### Can restore 10 GiB in 200 seconds + +This scenario verifies that the system can restored backed up data at +an acceptable speed. + @@ -46,10 +332,13 @@ and the body matches file data.dat --- title: "Obnam2—a backup system" author: Lars Wirzenius +documentclass: report bindings: - subplot/obnam.yaml functions: - subplot/obnam.py - subplot/runcmd.py - subplot/daemon.py +classes: + - json ... -- cgit v1.2.1