summaryrefslogtreecommitdiff
path: root/bugs/more-details-dedup.mdwn
blob: c00e3b5a2a2367e8d11abb6d2bd7030184c3e97d (plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
The following suggestion was sent to me by private e-mail (not sure if
I can show the sender's name):

    Date: Mon, 14 Oct 2013 11:48:41 +1100
    Subject: Re: Obnam repo size with .sql dump files seem too big

    Lars,

    We implemented a strategy for identifying repeated chunks, even in
    gzip-compressed files that have changes between versions. It might
    work for obnam. The downside is it creates variable-sized chunks,
    though you can set upper and lower limits.

    Firstly, we identify chunk boundaries by using a rolling checksum
    like Adler32, rolling over say a 128 byte contiguous region. Any
    other efficient FIR (box-car) checksum algorithm would work. When
    the bottom N bits of the checksum are zero, declare a block
    boundary. This gives blocks of average size 2**N. If you want more
    certainty, you can enforce a lower limit of 2**(N-3), and upper
    limit of 2**N or 2**(N+1), for example (thereby either creating,
    or ignoring, the boundaries that are defined by the checksum.

    These variable-sized chunks will re-synchronise after differences
    in two streams. To make it work with deflate, we flush the
    compression context (the dictionary, we maintain the history,
    losing 1-2% of compression) on each block boundary… but I'm not
    sure that compression is necessary for obnam.

    By choosing N appropriately, you get the block size you want. We
    used N=12, for internet delivery (using HTTP subrange requests) of
    updated files. For each file, we publish a catalog of Adler32 and
    SHA-1 block checksums. Clients download that using HTTP, then
    analyse their files before making requests for blocks they lack.

    The invention of doing this in a deflate stream is due to Tim
    Adam. When we discover we already have a given block, we can prime
    the decompressor with that history before decompressing a received
    update block. Really it should be implemented using 7zip instead
    of deflate; a 32KB quotation history is too small.

---

I haven't tried this yet. It will require a new repository format
version, so I've been working on adding that support instead.
--liw