[[!meta title="Encryption support in Obnam"]]

Obnam needs to support encryption of the backup store. This document
describes requirements for and design of how Obnam will be using
encryption. It is currently a **DRAFT** and feedback is very much
welcome.

[[!toc ]]


Goal
----

The goal of encryption in Obnam is to avoid unauthorized parties from
gaining access to backed up data. If the backup store is on a server
on the Internet, and someone breaks into the server, the data should
be safe from prying eyes. Likewise, if the backup store is on a USB
hard disk, and the disk gets stolen, the data should not leak.

This document does not worry about the data as it is being transferred.
The data is encrypted on the machine running obnam,
which reads the live data and stores it in a backup store,
and there may be up to three machines involved, and the links between
them are assumed to be over sftp or another sensible protocol.
(In other words, obnam may be running on one machine, the live data
may be on another one, and the backup store on a third one.)

For the purpose of this document, the server hosting the backup store
is considered to be unauthorized to access the data, or do anything
to the data except store it. This means that if the server needs to
things like removing unwanted backup generations, it needs to be
explicitly authorized to do so, by giving it encryption keys. By
default, it should not have that access, and will see only encrypted
data.


Requirements
------------

* All data in the backup store MUST be encrypted.
  Filenames are not encrypted, but Obnam uses random filenames whenever
  it can already.
* The backup store does not need to store any backup keys. It MUST be 
  enough for the keys to be stored in the clients only.
* For disaster recovery, it may be necessary to be able to access the
  backup store only using a passphrase. This should be supported by the
  design, but use of this feature MUST be optional.
* Clients can replace public keys used with the backup store at any time,
  without having to re-backup anything. After keys have been replaced, the
  old keys MUST NOT be able to access data in the backup store.
* The design MUST support use of multiple keys. Each client MUST be able to
  use their own key, and disallow anyone else from accessing their data,
  even if the backup store is shared. It should be possible for the server
  to have its own key that it uses for forgetting old generations, running
  fsck on the entire store, and so on.
  
The encryption keys for accessing the backup store are not used for
authenticating access to the backup store, to allow flexibility in
selecting store providers.


Encryption methods
------------------

There are two relevant encryption methods for Obnam:

* public key cryptography
* symmetric cryptography

With public keys, there is a optional passphrase for the secret key.
With symmetric encryption, the passphrase is the key. 
The user needs to manage the secret key and its passphrase in a suitable
manner.


Design
------

All data in the backup store is stored either in B-tree nodes,
or in a special directory tree for chunks of file data. Each B-tree
is stored in its own directory tree, and each node is stored in its
own file. We'll call each of these directory
trees a _toplevel_. Thus the root of a backup store consists of some
number of toplevel directories.

The backup store is used by some set of clients, plus perhaps the
backup server. We'll call each of these a user. Each user may have
its own public/private key pair, or they may be shared between
clients. This is up to the backup admins. The design assumes they
all have unique key pairs.

Some toplevels are private for a particular client. Others are
shared.

Every file in a toplevel is encrypted using symmetric encryption.
The symmetric encryption key is encrypted with the public
key of every user who needs access to the toplevel. In other words,
to have access to contents of the toplevel, a user needs to be able
to decrypt the symmetric encryption key using their own key.

To stop a user from having access to a toplevel, the symmetric encryption key
file gets re-encrypted with every other key except the unwanted one.
This does not prevent an attacker who has previously stored a
local copy of the decrypted symmetric encryption key. To stop that, every file
in the toplevel needs to be re-encrypted with a new symmetric
encryption key.

The encrypted symmetric encryption key is stored inside
the toplevel, using a well-known name. This is the only file not
encrypted with the symmetric encryption key.

The public keys for users who should have access to the toplevel
are also stored inside the toplevel, in a file encrypted with the
symmetric encryption key.

The symmetric encryption keys are generated randomly,
and are of sufficient length that brute-forcing them will not be
realistic. Perhaps something like 256 bits from /dev/random.

To give a user access to the repository, the repository admin needs
to add the user's public key to the shared toplevels: client list,
chunks, and chunk checksum data. A client may remove itself from
the repository, or an admin may do that. "Admin" in this context
is anyone with access to all the shared toplevels.


Example
-------

    chunks/                     -- toplevel for file data chunks
        key                     -- symmetric encryption key
        userkeys                -- public keys of all users of this toplevel
        *                       -- all other files

To set up toplevel for encryption:

* generate a suitable amount of random data to use as symmetric encryption key
* create list of public keys to have access to toplevel
* encrypt public key list using symmetric encryption key
* upload encrypted public key list
* encrypt symmetric encryption key with every public key in list
* upload encrypted symmetric encryption key

To add a file to the toplevel:

* download `key` file
* decrypt `key` file using user's private key, get key text
* encrypt file using symmetric encryption key
* upload encrypted file

To use a file in the toplevel:

* download `key` file
* decrypt `key` using user's private key
* download desired file
* decrypt file using symmetric encryption key

        
Discussion
----------

The simplest approach would be to only use public key encryption,
but this makes it difficult to change the keys. Changing the keys
is necessary to handle scenarios like giving access to the shared
toplevels to a new user with a new key pair. Otherwise the
symmetric encryption key 
needs to be distributed to every user, and re-distributed
if it ever changes, and this is cumbersome. It would also be possible
to re-encrypt everything in the toplevel for every new user, but
that is laughably inefficient. However, it would be acceptably 
simple to support the scenario of distributing the symmetric encryption key to
every user, if the backup admin thinks storing it on the backup
server even in encrypted form is too risky.

I have removed [[data signing|signing]] from this spec, on the suggestion of
Daniel Kahn Gilmor. Data signing will be dealt with separately.

I am going to assume that any public keys being used are generated
by the user, not by obnam.

I am not an encryption expert. I will not be implementing my own
encryption code, and do not even want to choose the specific
algorithms or key formats. I will be using GnuPG for all encryption
operations, because it is well-known and well-respected, and lets
me outsource all thinking.


Implementation outline
----------------------

General repository I/O operations (these correspond to the
`mkdir`, `write_file`, and `cat` operations in the Obnam VFS layer):

    def repo_mkdir(pathname):
        # create a new directory
        
    def repo_write_file(pathname, contents):
        # write a file to the repository (pathname is relative to repo)
        
    def repo_read_file(pathname):
        # read contents of a file in the repository (name is relative to repo)

General encryption routines:

    def generate_symmetric_key():
        # return N random bits to be used as a symmetric encryption key
        
    def encrypt_with_symmetric_key(data, symmetric_key):
        # return data encrypted using symmetric encryption
        
    def decrypt_with_symmetric_key(encrypted, symmetric_key):
        # return data after it has been decrypted using symmetric encryption
        
    def encrypt_with_pubkeys(data, pubkeys):
        # return data after it has been encrypted for all of the given 
        # public keys
        
    def decrypt_with_secret_key(encrypted, secret_key):
        # decrypt encrypted data using a secret key; this will fail unless
        # the data was encrypted using the public key corresponding to the
        # the secret key

Keyring handling in memory:

    def create_empty_keyring():
        ...
        
    def add_to_keyring(keyring, key):
        ...
        
    def keyring_contains(keyring, key):
        ...
        
    def remove_from_keyring(keyring, key):
        ...
        
    def encode_keyring(keyring):
        # Return form of keyring that can be stored on disk.
        
    def decode_keyring(encoded):
        # Inverse of encode_keyring.

Create a new toplevel:

    def create_toplevel(name, pubkeys):
        repo_mkdir(name)

        symmetric_key = generate_symmetric_key()
        encrypted = encrypt_with_pubkeys(symmetric_key, pubkeys)
        repo_write_file(name + '/key', encrypted)

        keyring = create_empty_keyring()
        for pubkey in pubkeys:
            add_to_keyring(keyring, pubkey)
        encoded = encode_keyring(keyring) 
        encrypted = encrypt_symmetric(encoded, symmetric_key)
        repo_write_file(name + '/userkeys', encrypted)

Reading and writing files in a toplevel:

    def get_symmetric_key(toplevel, secret_key):
        encoded = repo_read_file(toplevel + '/key')
        return decrypt_with_secret_key(encoded, secret_key)

    def toplevel_read_file(toplevel, filename, secret_key):
        symmetric_key = get_symmetric_key(toplevel, secret_key)
        encoded = repo_read_file(toplevel + '/' + filename)
        return decrypt_with_symmetric_key(encoded, symmetric_key)
        
    def toplevel_write_file(toplevel, filename, cleartext, secret_key):
        symmetric_key = get_symmetric_key(toplevel, secret_key)
        encoded = encrypt_with_symmetric_key(cleartext, symmetric_key)
        repo_write_file(toplevel + '/' + filename, encoded)

Manage keys for a toplevel:

    def read_keyring(toplevel, name, secret_key):
        encoded = toplevel_read_file(toplevel, name, secret_key)
        return decode_keyring(encoded)

    def write_keyring(toplevel, name, keyring, secret_key):
        encoded = encode_keyring(keyring)
        toplevel_write_file(toplevel, name, encoded, secret_key)

    def add_to_userkeys(toplevel, public_key, secret_key):
        userkeys = read_keyring(toplevel, 'userkeys', secret_key)
        if not keyring_contains(userkeys, public_key):
            add_to_keyring(userkeys, public_key)
            write_keyring(toplevel, 'userkeys', userkeys, secret_key)

    def remove_from_userkeys(toplevel, public_key, secret_key):
        userkeys = read_keyring(toplevel, 'userkeys', secret_key)
        if keyring_contains(userkeys, public_key):
            remove_from_keyring(userkeys, public_key)
            write_keyring(toplevel, 'userkeys', userkeys, secret_key)

Repository client management:

    def add_client(client_public_key, admin_secret_key):
        add_to_userkeys('metadata', client_public_key, admin_secret_key)
        add_to_userkeys('clientlist', client_public_key, admin_secret_key)
        add_to_userkeys('chunks', client_public_key, admin_secret_key)
        add_to_userkeys('chunksums', client_public_key, admin_secret_key)
        # client will add itself to the clientlist and create its own toplevel
        
    def remove_client(client_public_key, admin_secret_key):
        # client may remove itself, since it has access to the symmetric keys
        # we assume the client-specific toplevel has already been removed
        remove_from_userkeys('chunksums', client_public_key, admin_secret_key)
        remove_from_userkeys('chunks', client_public_key, admin_secret_key)
        remove_from_userkeys('clientlist', client_public_key, admin_secret_key)
        remove_from_userkeys('metadata', client_public_key, admin_secret_key)
        

Hooks in Obnam
--------------

Obnam's Repository class needs to have a pair of hooks for modifying 
data before it gets written to the repository, and after it has been 
read. These modifications should be each other's inverse functions. 
Apart from encryption, these hooks could be used for error correction 
codes for data in the store, and perhaps other things. The repository
should just provide and call the hooks, and not otherwise concern
itself with encryption.

These hooks are not needed at the VFS layer, since it is not necessary
to decrypt live data, nor encrypt data that is being restored.

The hooks correspond to `create_toplevel`, `toplevel_read_file`,
and `toplevel_write_file` above. However, to allow chains of callbacks
for the hooks, instead of the encryption callback writing the
data out to the repository, it should return it instead. The next
callback will get the encrypted data, and add, say, error correction
codes to it. Finally, when all callbacks are done, the encrypted and
error-corrected blob gets written to the repository.

Thus, Repository should provide the following hooks:

* `repo-create-toplevel(name)`: called whenever the repository has created
  a new toplevel directory
* `repo-write(toplevel, filename, data)`: called by the repository
  prepare data to be written to the repository
* `repo-read(toplevel, filename, data)`: called by repository for data that
  has been read from the repository

The hook subsystem needs to have a way to order callbacks, and for
each callback to return a modified form of the data for the next
callback to process (instead of the next callback processing the
original data).

Callback ordering is important so that encryption always happens
before ECC encoding: there's no point in ECC if it happens before
encryption.

Since the universe of likely Obnam plugins is small, and it can be
assumed that plugin authors co-operate, we can achieve ordering 
most simply by having an optional integer _order_ argument to
the hook registration method.

    def add_callback(self, name, callback, order=None):

Any callbacks without an explicit order will be put at the end
of the callback chain, and all others to the beginning, sorted
by the order argument into increasing order.

We can the arrange for the callback registrations for encryption
and ECC to use appropriate ordering:

    hooks.add_callback('repo-write', encrypt, order=1000)
    hooks.add_callback('repo-write', ecc, order=2000)

(Reverse the ordering for `repo-read`, of course.)

Arranging for a hook to be able to modify the data is a bit
trickier. Ideally, it could just return the new data, but the
general purpose nature of the hook subsystem means that it does
not know what the arguments for a hook are.

Thanks
------

Thank you to Richard Braakman, Peter Palfrader, Jaakko Niemi,
and Daniel Kahn Gillmor for feedback. Any problems that remain
are my fault.