From ee530a0f68743e6ef24e863254480673ff6d3319 Mon Sep 17 00:00:00 2001 From: Lars Wirzenius Date: Fri, 5 Feb 2021 07:32:36 +0200 Subject: refactor: reword for clarity --- obnam.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/obnam.md b/obnam.md index 833fbf4..c5a90c4 100644 --- a/obnam.md +++ b/obnam.md @@ -266,8 +266,8 @@ The responsibilities of the server are roughly: The responsibilities of the client are roughly: * split live data into chunks, upload them to server -* store metadata of live data files in a file, which represents a - backup generation, store that too as chunks on the server +* store metadata of live data files in a generation file, store that + too as chunks on the server * retrieve chunks from server when restoring * let user manage sharing of backups with other clients -- cgit v1.2.1 From 33e58a5e30c7838bb367712db94942b6289ccdae Mon Sep 17 00:00:00 2001 From: Lars Wirzenius Date: Fri, 5 Feb 2021 07:34:37 +0200 Subject: refactor: mention that generation is an sqlite db --- obnam.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/obnam.md b/obnam.md index c5a90c4..e3ee47d 100644 --- a/obnam.md +++ b/obnam.md @@ -266,8 +266,8 @@ The responsibilities of the server are roughly: The responsibilities of the client are roughly: * split live data into chunks, upload them to server -* store metadata of live data files in a generation file, store that - too as chunks on the server +* store metadata of live data files in a generation file (an SQLite + database), store that too as chunks on the server * retrieve chunks from server when restoring * let user manage sharing of backups with other clients -- cgit v1.2.1 From 172b48475cf8a3188398a41a863703188601ec9e Mon Sep 17 00:00:00 2001 From: Lars Wirzenius Date: Fri, 5 Feb 2021 08:41:59 +0200 Subject: doc: add section explaining the logical structure of backups --- obnam.md | 71 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 71 insertions(+) diff --git a/obnam.md b/obnam.md index e3ee47d..ae51d25 100644 --- a/obnam.md +++ b/obnam.md @@ -282,6 +282,77 @@ RSA-signed JSON Web Tokens. The server is configured to trust specific public keys. The clients have the private keys and generate the tokens themselves. +## Logical structure of backups + +For each backup (generation) the client stores, on the server, exactly +one _generation chunk_. This is a chunk that is specially marked as a +generation, but is otherwise not special. The generation chunk content +is a list of identifiers for chunks that form an SQLite database. + +The SQLite database lists all the files in the backup, as well as +their metadata. For each file, a list of chunk identifiers are listed, +for the content of the file. The chunks may be shared between files in +the same backup or different backups. + +File content data chunks are just blobs of data with no structure. +They have no reference to other data chunks, or to files or backups. +This makes it easier to share them between files. + +Let's look at an example. In the figure below there are three backups, +each using three chunks for file content data. One chunk, "data chunk +3", is shared between all three backups. + +~~~pikchr +GEN1: ellipse "Backup 1" big big +move 200% +GEN2: ellipse "Backup 2" big big +move 200% +GEN3: ellipse "Backup 3" big big + +arrow from GEN1.e right to GEN2.w +arrow from GEN2.e right to GEN3.w + +arrow from GEN1.s down 100% +DB1: box "SQLite" big big + +arrow from DB1.s left 20% then down 100% +C1: file "data" big big "chunk 1" big big + +arrow from DB1.s right 0% then down 70% then right 100% then down 30% +C2: file "data" big big "chunk 2" big big + +arrow from DB1.s right 20% then down 30% then right 200% then down 70% +C3: file "data" big big "chunk 3" big big + + + +arrow from GEN2.s down 100% +DB2: box "SQLite" big big + +arrow from DB2.s left 20% then down 100% then down 0.5*C3.height then to C3.e + +arrow from DB2.s right 0% then down 70% then right 100% then down 30% +C4: file "data" big big "chunk 4" big big + +arrow from DB2.s right 20% then down 30% then right 200% then down 70% +C5: file "data" big big "chunk 5" big big + + + + +arrow from GEN3.s down 100% +DB3: box "SQLite" big big + +arrow from DB3.s left 50% then down 100% then down 1.5*C3.height \ + then left until even with C3.s then up to C3.s + +arrow from DB3.s right 20% then down 100% +C6: file "data" big big "chunk 6" big big + +arrow from DB3.s right 60% then down 70% then right 100% then down 30% +C7: file "data" big big "chunk 7" big big +~~~ + # File metadata -- cgit v1.2.1 From 8ef1ae2040f669b3835d5a7f1f1dfb01ac23566c Mon Sep 17 00:00:00 2001 From: Lars Wirzenius Date: Fri, 5 Feb 2021 08:44:15 +0200 Subject: refactor: move things around to concentrate on important bits first --- obnam.md | 120 +++++++++++++++++++++++++++++++-------------------------------- 1 file changed, 60 insertions(+), 60 deletions(-) diff --git a/obnam.md b/obnam.md index ae51d25..09a226c 100644 --- a/obnam.md +++ b/obnam.md @@ -154,66 +154,6 @@ requirements and notes how they affect the architecture. access that. -## On SFTP versus HTTPS - -Obnam1 supported using a standard SFTP server as a backup repository, -and this was a popular feature. This section argues against supporting -SFTP in Obnam2. - -The performance requirement for network use means favoring protocols -such as HTTPS, or even QUIC, rather than SFTP. - -SFTP works on top of SSH. SSH provides a TCP-like abstraction for -SFTP, and thus multiple SFTP connections can run over the same SSH -connection. However, SSH itself uses a single TCP connection. If that -TCP connection has a dropped packet, all traffic over the SSH -connections, including all SFTP connections, waits until TCP -re-transmits the lost packet and re-synchronizes itself. - -With multiple HTTP connections, each on its own TCP connection, a -single dropped packet will not affect other HTTP transactions. Even -better, the new QUIC protocol doesn't use TCP. - -The modern Internet is to a large degree designed for massive use of -the world wide web, which is all HTTP, and adopting QUIC. It seems -wise for Obnam to make use of technologies that have been designed -for, and proven to work well with concurrency and network problems. - -Further, having used SFTP with Obnam1, it is not always an easy -protocol to use. Further, if there is a desire to have controlled -sharing of parts of one client's data with another, this would require -writing a custom SFTP service, which seems much harder to do than -writing a custom HTTP service. From experience, a custom HTTP service -is easy to do. A custom SFTP service would need to shoehorn the -abstractions it needs into something that looks more or less like a -Unix file system. - -The benefit of using SFTP would be that a standard SFTP service could -be used, if partial data sharing between clients is not needed. This -would simplify deployment and operations for many. However, it doesn't -seem important enough to warrant the implementation effort. - -Supporting both HTTP and SFTP would be possible, but also much more -work and against the desire to keep things simple. - -## On "btrfs send" and similar constructs - -The btrfs and ZFS file systems, and possibly others, have a way to -mark specific states of the file system and efficiently generate a -"delta file" of all the changes between the states. The delta can be -transferred elsewhere, and applied to a copy of the file system. This -can be quite efficient, but Obnam won't be built on top of such a -system. - -On the one hand, it would force the use of specific file systems: -Obnam would no be able to back up data on, say, an ext4 file system, -which seems to be the most popular one by far. - -Worse, it also for the data to be restored to the same type of file -system as where the live data was originally. This onerous for people -to do. - - ## Overall shape It seems fairly clear that a simple shape of the software architecture @@ -354,6 +294,66 @@ C7: file "data" big big "chunk 7" big big ~~~ +## On SFTP versus HTTPS + +Obnam1 supported using a standard SFTP server as a backup repository, +and this was a popular feature. This section argues against supporting +SFTP in Obnam2. + +The performance requirement for network use means favoring protocols +such as HTTPS, or even QUIC, rather than SFTP. + +SFTP works on top of SSH. SSH provides a TCP-like abstraction for +SFTP, and thus multiple SFTP connections can run over the same SSH +connection. However, SSH itself uses a single TCP connection. If that +TCP connection has a dropped packet, all traffic over the SSH +connections, including all SFTP connections, waits until TCP +re-transmits the lost packet and re-synchronizes itself. + +With multiple HTTP connections, each on its own TCP connection, a +single dropped packet will not affect other HTTP transactions. Even +better, the new QUIC protocol doesn't use TCP. + +The modern Internet is to a large degree designed for massive use of +the world wide web, which is all HTTP, and adopting QUIC. It seems +wise for Obnam to make use of technologies that have been designed +for, and proven to work well with concurrency and network problems. + +Further, having used SFTP with Obnam1, it is not always an easy +protocol to use. Further, if there is a desire to have controlled +sharing of parts of one client's data with another, this would require +writing a custom SFTP service, which seems much harder to do than +writing a custom HTTP service. From experience, a custom HTTP service +is easy to do. A custom SFTP service would need to shoehorn the +abstractions it needs into something that looks more or less like a +Unix file system. + +The benefit of using SFTP would be that a standard SFTP service could +be used, if partial data sharing between clients is not needed. This +would simplify deployment and operations for many. However, it doesn't +seem important enough to warrant the implementation effort. + +Supporting both HTTP and SFTP would be possible, but also much more +work and against the desire to keep things simple. + +## On "btrfs send" and similar constructs + +The btrfs and ZFS file systems, and possibly others, have a way to +mark specific states of the file system and efficiently generate a +"delta file" of all the changes between the states. The delta can be +transferred elsewhere, and applied to a copy of the file system. This +can be quite efficient, but Obnam won't be built on top of such a +system. + +On the one hand, it would force the use of specific file systems: +Obnam would no be able to back up data on, say, an ext4 file system, +which seems to be the most popular one by far. + +Worse, it also for the data to be restored to the same type of file +system as where the live data was originally. This onerous for people +to do. + + # File metadata Files in a file system contain data and have metadata: data about the -- cgit v1.2.1 From 3bfc9458c6e03751cf5067491347bb43d86bd372 Mon Sep 17 00:00:00 2001 From: Lars Wirzenius Date: Fri, 5 Feb 2021 10:18:31 +0200 Subject: doc: address the concept of content-addressable storage --- obnam.md | 65 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 65 insertions(+) diff --git a/obnam.md b/obnam.md index 09a226c..eaab2e8 100644 --- a/obnam.md +++ b/obnam.md @@ -354,6 +354,71 @@ system as where the live data was originally. This onerous for people to do. +## On content addressable storage + +[content-addressable storage]: https://en.wikipedia.org/wiki/Content-addressable_storage +[git version control system]: https://git-scm.com/ + +It would be possible to use the cryptographic checksum ("hash") of the +contents of a chunk as its identifier on the server side, also known +as [content-addressable storage][]. This would simplify de-duplication +of chunks. However, it also has some drawbacks: + +* it becomes harder to handle checksum collisions +* changing the checksum algorithm becomes harder + +In 2005, the author of [git version control system][] chose the +content addressable storage model, using the SHA1 checksum algorithm. +At the time, the git author considered SHA1 to be reasonably strong +from a cryptographic and security point of view, for git. In other +words, given the output of SHA1, it was difficult to deduce what the +input was, or to find another input that would give the same output, +known as a checksum collision. It is still difficult to deduce the +input, but manufacturing collisions is now feasible, with some +constraints. The git project has spent years changing the checksum +algorithm. + +Collisions are problematic for security applications of checksum +algorithms in general. Checksums are used, for example, in storing and +verifying passwords: the cleartext password is never stored, and +instead a checksum of it is computed and stored. To verify a later +login attempt a new checksum is computed from the newly entered +password from the attempt. If the checksums match, the password is +accepted.[^passwords] This means that if an attacker can find _any_ input that +gives the same output for the checksum algorithm used for password +storage, they can log in as if they were a valid user, whether the +password they have is the same as the real one. + +[^passwords]: In reality, storing passwords securely is much more + complicated than described here. + +For backups, and version control systems, collisions cause a different +problem: they can prevent the correct content from being stored. If +two files (or chunks) have the same checksum, only one will be stored. +If the files have different content, this is a problem. A backup +system should guard against this possibility. + +As an extreme and rare, but real, case consider a researcher of +checksum algorithms. They've spent enormous effort to produce two +distinct files that have the same checksum. They should be able make a +backup of the files, and restore them, and not lose one. They should +not have to know that their backup system uses the same checksum +algorithm they are researching, and have to guard against the backup +system getting the files confused. (Backup systems should be boring +and just always work.) + +Attacks on security-sensitive cryptographic algorithms only get +stronger by time. It is therefore necessary for Obnam to be able to +easily change the checksum algorithm it uses, without disruption for +user. To achieve this, Obnam does not use content-addressable storage. + +Obnam will (eventually, as this hasn't been implemented yet) allow +storing multiple checksums for each chunk. It will use the strongest +checksum available for a chunk. Over time, the checksums for chunks +can be replaced with stronger ones. This will allow Obnam to migrate +to a stronger algorithm when attacks against the current one become +too scary. + # File metadata Files in a file system contain data and have metadata: data about the -- cgit v1.2.1