Merge branch 'docs' into 'main'

Improve implementation/architecture docs a bit Closes #66 See merge request larswirzenius/obnam!84
author: Lars Wirzenius <liw@liw.fi> 2021-02-05 08:23:23 +0000
committer: Lars Wirzenius <liw@liw.fi> 2021-02-05 08:23:23 +0000
commit: 3eaa4c7dc1e3b7b6126f1e9ac818f01f400fee14 (patch)
tree: af8c8af2622b274df0eede8fa935dbcbea59862a
parent: 773d8e0458ce32f156606c4cc3d81097539459ff (diff)
parent: 3bfc9458c6e03751cf5067491347bb43d86bd372 (diff)
download: obnam2-3eaa4c7dc1e3b7b6126f1e9ac818f01f400fee14.tar.gz
1 files changed, 198 insertions, 62 deletions
diff --git a/obnam.md b/obnam.md
index 833fbf4..eaab2e8 100644
--- a/obnam.md
+++ b/obnam.md
@@ -154,66 +154,6 @@ requirements and notes how they affect the architecture.
   access that.
 
 
-## On SFTP versus HTTPS
-
-Obnam1 supported using a standard SFTP server as a backup repository,
-and this was a popular feature. This section argues against supporting
-SFTP in Obnam2.
-
-The performance requirement for network use means favoring protocols
-such as HTTPS, or even QUIC, rather than SFTP.
-
-SFTP works on top of SSH. SSH provides a TCP-like abstraction for
-SFTP, and thus multiple SFTP connections can run over the same SSH
-connection. However, SSH itself uses a single TCP connection. If that
-TCP connection has a dropped packet, all traffic over the SSH
-connections, including all SFTP connections, waits until TCP
-re-transmits the lost packet and re-synchronizes itself.
-
-With multiple HTTP connections, each on its own TCP connection, a
-single dropped packet will not affect other HTTP transactions. Even
-better, the new QUIC protocol doesn't use TCP.
-
-The modern Internet is to a large degree designed for massive use of
-the world wide web, which is all HTTP, and adopting QUIC. It seems
-wise for Obnam to make use of technologies that have been designed
-for, and proven to work well with concurrency and network problems.
-
-Further, having used SFTP with Obnam1, it is not always an easy
-protocol to use. Further, if there is a desire to have controlled
-sharing of parts of one client's data with another, this would require
-writing a custom SFTP service, which seems much harder to do than
-writing a custom HTTP service. From experience, a custom HTTP service
-is easy to do. A custom SFTP service would need to shoehorn the
-abstractions it needs into something that looks more or less like a
-Unix file system.
-
-The benefit of using SFTP would be that a standard SFTP service could
-be used, if partial data sharing between clients is not needed. This
-would simplify deployment and operations for many. However, it doesn't
-seem important enough to warrant the implementation effort.
-
-Supporting both HTTP and SFTP would be possible, but also much more
-work and against the desire to keep things simple.
-
-## On "btrfs send" and similar constructs
-
-The btrfs and ZFS file systems, and possibly others, have a way to
-mark specific states of the file system and efficiently generate a
-"delta file" of all the changes between the states. The delta can be
-transferred elsewhere, and applied to a copy of the file system. This
-can be quite efficient, but Obnam won't be built on top of such a
-system.
-
-On the one hand, it would force the use of specific file systems:
-Obnam would no be able to back up data on, say, an ext4 file system,
-which seems to be the most popular one by far.
-
-Worse, it also for the data to be restored to the same type of file
-system as where the live data was originally. This onerous for people
-to do.
-
-
 ## Overall shape
 
 It seems fairly clear that a simple shape of the software architecture
@@ -266,8 +206,8 @@ The responsibilities of the server are roughly:
 The responsibilities of the client are roughly:
 
 * split live data into chunks, upload them to server
-* store metadata of live data files in a file, which represents a
-  backup generation, store that too as chunks on the server
+* store metadata of live data files in a generation file (an SQLite
+  database), store that too as chunks on the server
 * retrieve chunks from server when restoring
 * let user manage sharing of backups with other clients
 
@@ -282,6 +222,202 @@ RSA-signed JSON Web Tokens. The server is configured to trust specific
 public keys. The clients have the private keys and generate the tokens
 themselves.
 
+## Logical structure of backups
+
+For each backup (generation) the client stores, on the server, exactly
+one _generation chunk_. This is a chunk that is specially marked as a
+generation, but is otherwise not special. The generation chunk content
+is a list of identifiers for chunks that form an SQLite database.
+
+The SQLite database lists all the files in the backup, as well as
+their metadata. For each file, a list of chunk identifiers are listed,
+for the content of the file. The chunks may be shared between files in
+the same backup or different backups.
+
+File content data chunks are just blobs of data with no structure.
+They have no reference to other data chunks, or to files or backups.
+This makes it easier to share them between files.
+
+Let's look at an example. In the figure below there are three backups,
+each using three chunks for file content data. One chunk, "data chunk
+3", is shared between all three backups.
+
+~~~pikchr
+GEN1: ellipse "Backup 1" big big
+move 200%
+GEN2: ellipse "Backup 2" big big
+move 200%
+GEN3: ellipse "Backup 3" big big
+
+arrow from GEN1.e right to GEN2.w
+arrow from GEN2.e right to GEN3.w
+
+arrow from GEN1.s down 100%
+DB1: box "SQLite" big big
+
+arrow from DB1.s left 20% then down 100%
+C1: file "data" big big "chunk 1" big big
+
+arrow from DB1.s right 0% then down 70% then right 100% then down 30%
+C2: file "data" big big "chunk 2" big big
+
+arrow from DB1.s right 20% then down 30% then right 200% then down 70%
+C3: file "data" big big "chunk 3" big big
+
+
+
+arrow from GEN2.s down 100%
+DB2: box "SQLite" big big
+
+arrow from DB2.s left 20% then down 100% then down 0.5*C3.height then to C3.e
+
+arrow from DB2.s right 0% then down 70% then right 100% then down 30%
+C4: file "data" big big "chunk 4" big big
+
+arrow from DB2.s right 20% then down 30% then right 200% then down 70%
+C5: file "data" big big "chunk 5" big big
+
+
+
+
+arrow from GEN3.s down 100%
+DB3: box "SQLite" big big
+
+arrow from DB3.s left 50% then down 100% then down 1.5*C3.height \
+  then left until even with C3.s then up to C3.s
+
+arrow from DB3.s right 20% then down 100%
+C6: file "data" big big "chunk 6" big big
+
+arrow from DB3.s right 60% then down 70% then right 100% then down 30%
+C7: file "data" big big "chunk 7" big big
+~~~
+
+
+## On SFTP versus HTTPS
+
+Obnam1 supported using a standard SFTP server as a backup repository,
+and this was a popular feature. This section argues against supporting
+SFTP in Obnam2.
+
+The performance requirement for network use means favoring protocols
+such as HTTPS, or even QUIC, rather than SFTP.
+
+SFTP works on top of SSH. SSH provides a TCP-like abstraction for
+SFTP, and thus multiple SFTP connections can run over the same SSH
+connection. However, SSH itself uses a single TCP connection. If that
+TCP connection has a dropped packet, all traffic over the SSH
+connections, including all SFTP connections, waits until TCP
+re-transmits the lost packet and re-synchronizes itself.
+
+With multiple HTTP connections, each on its own TCP connection, a
+single dropped packet will not affect other HTTP transactions. Even
+better, the new QUIC protocol doesn't use TCP.
+
+The modern Internet is to a large degree designed for massive use of
+the world wide web, which is all HTTP, and adopting QUIC. It seems
+wise for Obnam to make use of technologies that have been designed
+for, and proven to work well with concurrency and network problems.
+
+Further, having used SFTP with Obnam1, it is not always an easy
+protocol to use. Further, if there is a desire to have controlled
+sharing of parts of one client's data with another, this would require
+writing a custom SFTP service, which seems much harder to do than
+writing a custom HTTP service. From experience, a custom HTTP service
+is easy to do. A custom SFTP service would need to shoehorn the
+abstractions it needs into something that looks more or less like a
+Unix file system.
+
+The benefit of using SFTP would be that a standard SFTP service could
+be used, if partial data sharing between clients is not needed. This
+would simplify deployment and operations for many. However, it doesn't
+seem important enough to warrant the implementation effort.
+
+Supporting both HTTP and SFTP would be possible, but also much more
+work and against the desire to keep things simple.
+
+## On "btrfs send" and similar constructs
+
+The btrfs and ZFS file systems, and possibly others, have a way to
+mark specific states of the file system and efficiently generate a
+"delta file" of all the changes between the states. The delta can be
+transferred elsewhere, and applied to a copy of the file system. This
+can be quite efficient, but Obnam won't be built on top of such a
+system.
+
+On the one hand, it would force the use of specific file systems:
+Obnam would no be able to back up data on, say, an ext4 file system,
+which seems to be the most popular one by far.
+
+Worse, it also for the data to be restored to the same type of file
+system as where the live data was originally. This onerous for people
+to do.
+
+
+## On content addressable storage
+
+[content-addressable storage]: https://en.wikipedia.org/wiki/Content-addressable_storage
+[git version control system]: https://git-scm.com/
+
+It would be possible to use the cryptographic checksum ("hash") of the
+contents of a chunk as its identifier on the server side, also known
+as [content-addressable storage][]. This would simplify de-duplication
+of chunks. However, it also has some drawbacks:
+
+* it becomes harder to handle checksum collisions
+* changing the checksum algorithm becomes harder
+
+In 2005, the author of [git version control system][] chose the
+content addressable storage model, using the SHA1 checksum algorithm.
+At the time, the git author considered SHA1 to be reasonably strong
+from a cryptographic and security point of view, for git. In other
+words, given the output of SHA1, it was difficult to deduce what the
+input was, or to find another input that would give the same output,
+known as a checksum collision. It is still difficult to deduce the
+input, but manufacturing collisions is now feasible, with some
+constraints. The git project has spent years changing the checksum
+algorithm.
+
+Collisions are problematic for security applications of checksum
+algorithms in general. Checksums are used, for example, in storing and
+verifying passwords: the cleartext password is never stored, and
+instead a checksum of it is computed and stored. To verify a later
+login attempt a new checksum is computed from the newly entered
+password from the attempt. If the checksums match, the password is
+accepted.[^passwords] This means that if an attacker can find _any_ input that
+gives the same output for the checksum algorithm used for password
+storage, they can log in as if they were a valid user, whether the
+password they have is the same as the real one.
+
+[^passwords]: In reality, storing passwords securely is much more
+    complicated than described here.
+
+For backups, and version control systems, collisions cause a different
+problem: they can prevent the correct content from being stored. If
+two files (or chunks) have the same checksum, only one will be stored.
+If the files have different content, this is a problem. A backup
+system should guard against this possibility.
+
+As an extreme and rare, but real, case consider a researcher of
+checksum algorithms. They've spent enormous effort to produce two
+distinct files that have the same checksum. They should be able make a
+backup of the files, and restore them, and not lose one. They should
+not have to know that their backup system uses the same checksum
+algorithm they are researching, and have to guard against the backup
+system getting the files confused. (Backup systems should be boring
+and just always work.)
+
+Attacks on security-sensitive cryptographic algorithms only get
+stronger by time. It is therefore necessary for Obnam to be able to
+easily change the checksum algorithm it uses, without disruption for
+user. To achieve this, Obnam does not use content-addressable storage.
+
+Obnam will (eventually, as this hasn't been implemented yet) allow
+storing multiple checksums for each chunk. It will use the strongest
+checksum available for a chunk. Over time, the checksums for chunks
+can be replaced with stronger ones. This will allow Obnam to migrate
+to a stronger algorithm when attacks against the current one become
+too scary.
 
 # File metadata
author	Lars Wirzenius <liw@liw.fi>	2021-02-05 08:23:23 +0000
committer	Lars Wirzenius <liw@liw.fi>	2021-02-05 08:23:23 +0000
commit	3eaa4c7dc1e3b7b6126f1e9ac818f01f400fee14 (patch)
tree	af8c8af2622b274df0eede8fa935dbcbea59862a
parent	773d8e0458ce32f156606c4cc3d81097539459ff (diff)
parent	3bfc9458c6e03751cf5067491347bb43d86bd372 (diff)
download	obnam2-3eaa4c7dc1e3b7b6126f1e9ac818f01f400fee14.tar.gz