[[!meta title="Encryption support in Obnam"]] Obnam needs to support encryption of the backup store. This document describes requirements for and design of how Obnam will be using encryption. It is currently a **DRAFT** and feedback is very much welcome. [[!toc ]] Goal ---- The goal of encryption in Obnam is to avoid unauthorized parties from gaining access to backed up data. If the backup store is on a server on the Internet, and someone breaks into the server, the data should be safe from prying eyes. Likewise, if the backup store is on a USB hard disk, and the disk gets stolen, the data should not leak. This document does not worry about the data as it is being transferred. The data is encrypted on the machine running obnam, which reads the live data and stores it in a backup store, and there may be up to three machines involved, and the links between them are assumed to be over sftp or another sensible protocol. (In other words, obnam may be running on one machine, the live data may be on another one, and the backup store on a third one.) For the purpose of this document, the server hosting the backup store is considered to be unauthorized to access the data, or do anything to the data except store it. This means that if the server needs to things like removing unwanted backup generations, it needs to be explicitly authorized to do so, by giving it encryption keys. By default, it should not have that access, and will see only encrypted data. Requirements ------------ * All data in the backup store MUST be encrypted. Filenames are not encrypted, but Obnam uses random filenames whenever it can already. * The backup store does not need to store any backup keys. It MUST be enough for the keys to be stored in the clients only. * For disaster recovery, it may be necessary to be able to access the backup store only using a passphrase. This should be supported by the design, but use of this feature MUST be optional. * Clients can replace public keys used with the backup store at any time, without having to re-backup anything. After keys have been replaced, the old keys MUST NOT be able to access data in the backup store. * The design MUST support use of multiple keys. Each client MUST be able to use their own key, and disallow anyone else from accessing their data, even if the backup store is shared. It should be possible for the server to have its own key that it uses for forgetting old generations, running fsck on the entire store, and so on. The encryption keys for accessing the backup store are not used for authenticating access to the backup store, to allow flexibility in selecting store providers. Encryption methods ------------------ There are two relevant encryption methods for Obnam: * public key cryptography * symmetric cryptography With public keys, there is a optional passphrase for the secret key. With symmetric encryption, the passphrase is the key. The user needs to manage the secret key and its passphrase in a suitable manner. Design ------ All data in the backup store is stored either in B-tree nodes, or in a special directory tree for chunks of file data. Each B-tree is stored in its own directory tree, and each node is stored in its own file. We'll call each of these directory trees a _toplevel_. Thus the root of a backup store consists of some number of toplevel directories. The backup store is used by some set of clients, plus perhaps the backup server. We'll call each of these a user. Each user may have its own public/private key pair, or they may be shared between clients. This is up to the backup admins. The design assumes they all have unique key pairs. Some toplevels are private for a particular client. Others are shared. Every file in a toplevel is encrypted using symmetric encryption. The symmetric encryption key is encrypted with the public key of every user who needs access to the toplevel. In other words, to have access to contents of the toplevel, a user needs to be able to decrypt the symmetric encryption key using their own key. To stop a user from having access to a toplevel, the symmetric encryption key file gets re-encrypted with every other key except the unwanted one. This does not prevent an attacker who has previously stored a local copy of the decrypted symmetric encryption key. To stop that, every file in the toplevel needs to be re-encrypted with a new symmetric encryption key. The encrypted symmetric encryption key is stored inside the toplevel, using a well-known name. This is the only file not encrypted with the symmetric encryption key. The public keys for users who should have access to the toplevel are also stored inside the toplevel, in a file encrypted with the symmetric encryption key. The symmetric encryption keys are generated randomly, and are of sufficient length that brute-forcing them will not be realistic. Perhaps something like 256 bits from /dev/random. To give a user access to the repository, the repository admin needs to add the user's public key to the shared toplevels: client list, chunks, and chunk checksum data. A client may remove itself from the repository, or an admin may do that. "Admin" in this context is anyone with access to all the shared toplevels. Example ------- chunks/ -- toplevel for file data chunks key -- symmetric encryption key userkeys -- public keys of all users of this toplevel * -- all other files To set up toplevel for encryption: * generate a suitable amount of random data to use as symmetric encryption key * create list of public keys to have access to toplevel * encrypt public key list using symmetric encryption key * upload encrypted public key list * encrypt symmetric encryption key with every public key in list * upload encrypted symmetric encryption key To add a file to the toplevel: * download `key` file * decrypt `key` file using user's private key, get key text * encrypt file using symmetric encryption key * upload encrypted file To use a file in the toplevel: * download `key` file * decrypt `key` using user's private key * download desired file * decrypt file using symmetric encryption key Discussion ---------- The simplest approach would be to only use public key encryption, but this makes it difficult to change the keys. Changing the keys is necessary to handle scenarios like giving access to the shared toplevels to a new user with a new key pair. Otherwise the symmetric encryption key needs to be distributed to every user, and re-distributed if it ever changes, and this is cumbersome. It would also be possible to re-encrypt everything in the toplevel for every new user, but that is laughably inefficient. However, it would be acceptably simple to support the scenario of distributing the symmetric encryption key to every user, if the backup admin thinks storing it on the backup server even in encrypted form is too risky. I have removed [[data signing|signing]] from this spec, on the suggestion of Daniel Kahn Gilmor. Data signing will be dealt with separately. I am going to assume that any public keys being used are generated by the user, not by obnam. I am not an encryption expert. I will not be implementing my own encryption code, and do not even want to choose the specific algorithms or key formats. I will be using GnuPG for all encryption operations, because it is well-known and well-respected, and lets me outsource all thinking. Implementation outline ---------------------- General repository I/O operations (these correspond to the `mkdir`, `write_file`, and `cat` operations in the Obnam VFS layer): def repo_mkdir(pathname): # create a new directory def repo_write_file(pathname, contents): # write a file to the repository (pathname is relative to repo) def repo_read_file(pathname): # read contents of a file in the repository (name is relative to repo) General encryption routines: def generate_symmetric_key(): # return N random bits to be used as a symmetric encryption key def encrypt_with_symmetric_key(data, symmetric_key): # return data encrypted using symmetric encryption def decrypt_with_symmetric_key(encrypted, symmetric_key): # return data after it has been decrypted using symmetric encryption def encrypt_with_pubkeys(data, pubkeys): # return data after it has been encrypted for all of the given # public keys def decrypt_with_secret_key(encrypted, secret_key): # decrypt encrypted data using a secret key; this will fail unless # the data was encrypted using the public key corresponding to the # the secret key Keyring handling in memory: def create_empty_keyring(): ... def add_to_keyring(keyring, key): ... def keyring_contains(keyring, key): ... def remove_from_keyring(keyring, key): ... def encode_keyring(keyring): # Return form of keyring that can be stored on disk. def decode_keyring(encoded): # Inverse of encode_keyring. Create a new toplevel: def create_toplevel(name, pubkeys): repo_mkdir(name) symmetric_key = generate_symmetric_key() encrypted = encrypt_with_pubkeys(symmetric_key, pubkeys) repo_write_file(name + '/key', encrypted) keyring = create_empty_keyring() for pubkey in pubkeys: add_to_keyring(keyring, pubkey) encoded = encode_keyring(keyring) encrypted = encrypt_symmetric(encoded, symmetric_key) repo_write_file(name + '/userkeys', encrypted) Reading and writing files in a toplevel: def get_symmetric_key(toplevel, secret_key): encoded = repo_read_file(toplevel + '/key') return decrypt_with_secret_key(encoded, secret_key) def toplevel_read_file(toplevel, filename, secret_key): symmetric_key = get_symmetric_key(toplevel, secret_key) encoded = repo_read_file(toplevel + '/' + filename) return decrypt_with_symmetric_key(encoded, symmetric_key) def toplevel_write_file(toplevel, filename, cleartext, secret_key): symmetric_key = get_symmetric_key(toplevel, secret_key) encoded = encrypt_with_symmetric_key(cleartext, symmetric_key) repo_write_file(toplevel + '/' + filename, encoded) Manage keys for a toplevel: def read_keyring(toplevel, name, secret_key): encoded = toplevel_read_file(toplevel, name, secret_key) return decode_keyring(encoded) def write_keyring(toplevel, name, keyring, secret_key): encoded = encode_keyring(keyring) toplevel_write_file(toplevel, name, encoded, secret_key) def add_to_userkeys(toplevel, public_key, secret_key): userkeys = read_keyring(toplevel, 'userkeys', secret_key) if not keyring_contains(userkeys, public_key): add_to_keyring(userkeys, public_key) write_keyring(toplevel, 'userkeys', userkeys, secret_key) def remove_from_userkeys(toplevel, public_key, secret_key): userkeys = read_keyring(toplevel, 'userkeys', secret_key) if keyring_contains(userkeys, public_key): remove_from_keyring(userkeys, public_key) write_keyring(toplevel, 'userkeys', userkeys, secret_key) Repository client management: def add_client(client_public_key, admin_secret_key): add_to_userkeys('metadata', client_public_key, admin_secret_key) add_to_userkeys('clientlist', client_public_key, admin_secret_key) add_to_userkeys('chunks', client_public_key, admin_secret_key) add_to_userkeys('chunksums', client_public_key, admin_secret_key) # client will add itself to the clientlist and create its own toplevel def remove_client(client_public_key, admin_secret_key): # client may remove itself, since it has access to the symmetric keys # we assume the client-specific toplevel has already been removed remove_from_userkeys('chunksums', client_public_key, admin_secret_key) remove_from_userkeys('chunks', client_public_key, admin_secret_key) remove_from_userkeys('clientlist', client_public_key, admin_secret_key) remove_from_userkeys('metadata', client_public_key, admin_secret_key) Hooks in Obnam -------------- Obnam's Repository class needs to have a pair of hooks for modifying data before it gets written to the repository, and after it has been read. These modifications should be each other's inverse functions. Apart from encryption, these hooks could be used for error correction codes for data in the store, and perhaps other things. The repository should just provide and call the hooks, and not otherwise concern itself with encryption. These hooks are not needed at the VFS layer, since it is not necessary to decrypt live data, nor encrypt data that is being restored. The hooks correspond to `create_toplevel`, `toplevel_read_file`, and `toplevel_write_file` above. However, to allow chains of callbacks for the hooks, instead of the encryption callback writing the data out to the repository, it should return it instead. The next callback will get the encrypted data, and add, say, error correction codes to it. Finally, when all callbacks are done, the encrypted and error-corrected blob gets written to the repository. Thus, Repository should provide the following hooks: * `repo-create-toplevel(name)`: called whenever the repository has created a new toplevel directory * `repo-write(toplevel, filename, data)`: called by the repository prepare data to be written to the repository * `repo-read(toplevel, filename, data)`: called by repository for data that has been read from the repository The hook subsystem needs to have a way to order callbacks, and for each callback to return a modified form of the data for the next callback to process (instead of the next callback processing the original data). Callback ordering is important so that encryption always happens before ECC encoding: there's no point in ECC if it happens before encryption. Since the universe of likely Obnam plugins is small, and it can be assumed that plugin authors co-operate, we can achieve ordering most simply by having an optional integer _order_ argument to the hook registration method. def add_callback(self, name, callback, order=None): Any callbacks without an explicit order will be put at the end of the callback chain, and all others to the beginning, sorted by the order argument into increasing order. We can the arrange for the callback registrations for encryption and ECC to use appropriate ordering: hooks.add_callback('repo-write', encrypt, order=1000) hooks.add_callback('repo-write', ecc, order=2000) (Reverse the ordering for `repo-read`, of course.) Arranging for a hook to be able to modify the data is a bit trickier. Ideally, it could just return the new data, but the general purpose nature of the hook subsystem means that it does not know what the arguments for a hook are. Thanks ------ Thank you to Richard Braakman, Peter Palfrader, Jaakko Niemi, and Daniel Kahn Gillmor for feedback. Any problems that remain are my fault.