1 files changed, 184 insertions, 53 deletions
diff --git a/obnam.md b/obnam.md
index 396bc44..c4122c2 100644
--- a/obnam.md
+++ b/obnam.md
@@ -1,26 +1,3 @@
----
-title: "Obnam2&mdash;a backup system"
-author: Lars Wirzenius
-documentclass: report
-bindings:
-  - subplot/server.yaml
-  - subplot/client.yaml
-  - subplot/data.yaml
-  - lib/files.yaml
-  - lib/runcmd.yaml
-impls:
-  python:
-  - subplot/server.py
-  - subplot/client.py
-  - subplot/data.py
-  - lib/daemon.py
-  - lib/files.py
-  - lib/runcmd.py
-classes:
-  - json
-  - sql
-...
-
 # Abstract
 
 Obnam is a backup system, consisting of a not very smart server for
@@ -204,6 +181,23 @@ This is mitigated in two ways:
 
 [CACHEDIR.TAG]: https://bford.info/cachedir/
 
+## Attacker can read backups via chunk server HTTP API
+
+This threat arises from the fact that the chunk server HTTP API
+currently has no authentication. This allows an attacker who can
+access the API to copy the backups and break their encryption at
+leisure.
+
+The mitigation is to add access control for the API.
+
+A simple approach is to have the chunk server admin to create an
+**access token** that the client must provide with each API request.
+The token can be stored in the client configuration by `obnam init`.
+
+This would be the simplest possible access control approach. More
+nuanced approaches will be added later.
+
+
 # Software architecture
 
 ## Effects of requirements
@@ -587,6 +581,153 @@ using configuration management, and if necessary, backups can be
 triggered on each host by having the server reach out and run the
 Obnam client.
 
+## On splitting file data into chunks
+
+A backup program needs to split the data it backs up into chunks. This
+can be done in various ways.
+
+### A complete file as a single chunk
+
+This is a very simple approach, where the whole file is considered to
+be a chunk, regardless of its size.
+
+Using complete files is often impractical, since they need to be
+stored and transferred as a unit. If a file is enormous, transferring
+it completely can be a challenge: if there's a one-bit error in the
+transfer, the whole thing needs to be transferred again.
+
+There is no de-duplication except possibly of entire files.
+
+### Fixed size chunks
+
+Split a file into chunks of a fixed size. For example, if the chunk
+size is 1 MiB, a 1 GiB file is 1024 chunks. All chunks are of the same
+size, unless a file size is not a multiple of the chunk size.
+
+Fixed size chunks are very easy to implement and make de-duplication
+of partial files possible. However, that de-duplication tends to only
+work for the beginnings of file: inserting data in the file tends to
+result in chunks after the insertion not matching anymore.
+
+### Splitting based on a formula using content
+
+A rolling checksum function is computed on a sliding window of bytes
+from the input file. The window has a fixed size. The function is
+extremely efficient to compute when bytes are moved into or out of the
+window. When the value of the function, the checksum, matches a
+certain bit pattern, it is considered a chunk boundary. Such a pattern
+might for example be that the lowest N bits are zero. Any data that
+is pushed out of the sliding window also forms a chunk.
+
+The code to split into chunks may set minimum and maximum sizes of
+chunks, whether from checksum patterns or overflowed bytes. This
+prevents pathological input data, where the checksum has the boundary
+bit pattern after every byte, to not result in each input byte being
+its own chunk.
+
+This finds chunks efficiently even new data is inserted into the input
+data, with some caveats.
+
+Example: assume a sliding window of four bytes, and a 17-byte input
+file where there are four copies of the same 4-byte sequence with a
+random byte in between.
+
++------+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+
+|data  | a| b| c| d| a| b| c| d|ff| a| b| c| d| a| b| c| d|
++------+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+
+|offset|01|02|03|04|05|06|07|08|09|10|11|12|13|14|15|16|17|
++------+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+
+
+Bytes 1-4 are the same as bytes 5-8, 10-13, and 14-17. When we compute
+the checksum for each input byte, we get the results below, for a
+moving sum function.
+
++--------+------+----------+-----------------+
+| offset | byte | checksum | chunk boundary? |
++--------+------+----------+-----------------+
+| 0      | a    | a        | no              |
++--------+------+----------+-----------------+
+| 1      | b    | a+b      | no              |
++--------+------+----------+-----------------+
+| 2      | c    | a+b+c    | no              |
++--------+------+----------+-----------------+
+| 3      | d    | a+b+c+   | yes             |
++--------+------+----------+-----------------+
+| 4      | a    | a        | no              |
++--------+------+----------+-----------------+
+| 5      | b    | a+b      | no              |
++--------+------+----------+-----------------+
+| 6      | c    | a+b+c    | no              |
++--------+------+----------+-----------------+
+| 7      | d    | a+b+c+   | yes             |
++--------+------+----------+-----------------+
+| 9      | ff   | ff       | no              |
++--------+------+----------+-----------------+
+| 10     | a    | ff+a     | no              |
++--------+------+----------+-----------------+
+| 11     | b    | ff+a+b   | no              |
++--------+------+----------+-----------------+
+| 12     | c    | ff+a+b+c | no              |
++--------+------+----------+-----------------+
+| 13     | d    | a+b+c+d  | yes             |
++--------+------+----------+-----------------+
+| 14     | a    | a        | no              |
++--------+------+----------+-----------------+
+| 15     | b    | a+b      | no              |
++--------+------+----------+-----------------+
+| 16     | c    | a+b+c    | no              |
++--------+------+----------+-----------------+
+| 17     | d    | a+b+c+   | yes             |
++--------+------+----------+-----------------+
+
+Note that in this example, the byte at offset 9 (0xff) slides out of
+the window then byte at offset 13 slides in, and in results in bytes
+at offsets 10-13 being recognized as a chunk by the checksum function.
+This example is carefully constructed for that happy co-incidence. In
+a more realistic scenario, the data after the inserted bytes might not
+notice a chunk boundary until after the sliding window has been filled
+once.
+
+By choosing a suitable window size and checksum value pattern for
+chunk boundaries, the chunk splitting can find smaller or large
+chunks, balancing the possibility for more detailed de-duplication
+versus the overhead of storing many chunks.
+
+### Varying splitting method based on file type
+
+The data may contain files of different types, and this can be used to
+vary the way data is split into chunks. For example, compressed video
+files may use one way of chunking, while software source code may use
+another.
+
+For example:
+
+* emails in mbox or Maildir formats could be split based on headers,
+  body, and attachment and each of those into chunks
+* SQL dumps of databases tend to contain very large numbers containing
+  the same structure
+* video files are often split into frames, possibly those can be used
+  for intelligent chunking?
+* uncompressed tar files have a definite structure (header, followed
+  by file data) that can probably be used for splitting into chunks
+* ZIP files compress each file separately, which could be used to
+  split them into chunks: this way, two ZIP files with the same file
+  inside them might share the compressed file as a chunk
+* disk images often contain long sequences of zeroes, which could be
+  used for splitting into chunks
+
+### Next actions
+
+Obnam currently splits data using fixed size chunks. This can and will
+be improved, and the changes will only affect the client. Help is
+welcome.
+
+### Thanks
+
+Thank you to Daniel Silverstone who explained some of the mathematics
+about this to me.
+
+
 # File metadata
 
 Files in a file system contain data and have metadata: data about the
@@ -1118,21 +1259,6 @@ and content-type is application/json
 and the JSON body matches {"<ID>":{"label":"0abc"}}
 ~~~
 
-Finally, we must be able to delete it. After that, we must not be able
-to retrieve it, or find it using metadata.
-
-~~~scenario
-when I DELETE /v1/chunks/<ID>
-then HTTP status code is 200
-
-when I GET /v1/chunks/<ID>
-then HTTP status code is 404
-
-when I GET /v1/chunks?label=0abc
-then HTTP status code is 200
-and content-type is application/json
-and the JSON body matches {}
-~~~
 
 ## Retrieve a chunk that does not exist
 
@@ -1157,17 +1283,6 @@ and content-type is application/json
 and the JSON body matches {}
 ~~~
 
-## Delete chunk that does not exist
-
-We must get the right error when deleting a chunk that doesn't exist.
-
-~~~scenario
-given a working Obnam system
-when I try to DELETE /v1/chunks/any.random.string
-then HTTP status code is 404
-~~~
-
-
 ## Persistent across restarts
 
 Chunk storage, and the index of chunk metadata for searches, needs to
@@ -1333,7 +1448,7 @@ and a client config based on ca-required.yaml
 and a file live/data.dat containing some random data
 when I try to run obnam backup
 then command fails
-then stderr contains "self signed certificate"
+then stderr matches regex self.signed certificate
 ~~~
 
 ~~~{#ca-required.yaml .file .yaml .numberLines}
@@ -1662,8 +1777,12 @@ given a manifest of the directory live restored in rest in rest.yaml
 then manifests live.yaml and rest.yaml match
 ~~~
 
-## Unreadable file
+## FIXME: Unreadable file
 
+FIXME: This scenario has been disabled, temporarily, as my current CI
+system runs things as `root` and that means this scenario fails.
+
+~~~~~~~~
 This scenario verifies that Obnam will back up all files of live data,
 even if one of them is unreadable. By inference, we assume this means
 other errors on individual files also won't end the backup
@@ -1682,9 +1801,14 @@ when I invoke obnam restore <GEN> rest
 then file live/data.dat is restored to rest
 then file live/bad.dat is not restored to rest
 ~~~
+~~~~~~~~
+
+## FIXME: Unreadable directory
 
-## Unreadable directory
+FIXME: This scenario has been disabled, temporarily, as my current CI
+system runs things as `root` and that means this scenario fails.
 
+~~~~~~~~
 This scenario verifies that Obnam will skip a file in a directory it
 can't read. Obnam should warn about that, but not give an error.
 
@@ -1700,9 +1824,14 @@ when I invoke obnam restore <GEN> rest
 then file live/unreadable is restored to rest
 then file live/unreadable/data.dat is not restored to rest
 ~~~
+~~~~~~~~
 
-## Unexecutable directory
+## FIXME: Unexecutable directory
 
+FIXME: This scenario has been disabled, temporarily, as my current CI
+system runs things as `root` and that means this scenario fails.
+
+~~~~~~~
 This scenario verifies that Obnam will skip a file in a directory it
 can't read. Obnam should warn about that, but not give an error.
 
@@ -1718,6 +1847,8 @@ when I invoke obnam restore <GEN> rest
 then file live/dir is restored to rest
 then file live/dir/data.dat is not restored to rest
 ~~~
+~~~~~~~
+
 
 ## Restore latest generation