bugs/salsa-tins.mdwn


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39

[[!tag obnam-performance]]

Problem: If chunk size is reasonably large (say, a megabyte), then
most files will be smaller, and the repository ends up with a large
number of identical files.

Idea: collect chunks into groups, called "salsa tins".

- salsa tin = list of chunks
- salsa tin has an id
- chunk id = salsa tin id + suitable number of extra bits for
  index into list
- chunk id may be 64 bits total, or 64+32, or whatever seems convenient
- no chunk gets stored alone, only in salsa tins

This lets a client put things into the repository at will, without
synchronisation or locking beyond what the filesystem provides
(exclusive creation of files).


---

Having multiple chunks in a single file complicates the logic for
managing files in the repository, and deleting unused chunks.

Therefore, an alternative idea: instead of shoving multiple chunks
into one file, allow files to use parts of chunks. Currently a
file's metadata lists the chunks that have its contents. Change
this to be a list of (chunk id, offset, length) triplets, where
offset and length specify a part of a chunk. This way, a client can
create one chunk that contains the data of many small files, and
they can all just use the relevant part of the chunk. Managing
removal of those files is easy: it is the current code without
modification. 

--liw


This is implemented in git for FORMAT GREEN ALBATROSS. [[done]] --liw