summaryrefslogtreecommitdiff
path: root/bugs/more-details-dedup.mdwn
diff options
context:
space:
mode:
Diffstat (limited to 'bugs/more-details-dedup.mdwn')
-rw-r--r--bugs/more-details-dedup.mdwn45
1 files changed, 0 insertions, 45 deletions
diff --git a/bugs/more-details-dedup.mdwn b/bugs/more-details-dedup.mdwn
deleted file mode 100644
index c00e3b5..0000000
--- a/bugs/more-details-dedup.mdwn
+++ /dev/null
@@ -1,45 +0,0 @@
-The following suggestion was sent to me by private e-mail (not sure if
-I can show the sender's name):
-
- Date: Mon, 14 Oct 2013 11:48:41 +1100
- Subject: Re: Obnam repo size with .sql dump files seem too big
-
- Lars,
-
- We implemented a strategy for identifying repeated chunks, even in
- gzip-compressed files that have changes between versions. It might
- work for obnam. The downside is it creates variable-sized chunks,
- though you can set upper and lower limits.
-
- Firstly, we identify chunk boundaries by using a rolling checksum
- like Adler32, rolling over say a 128 byte contiguous region. Any
- other efficient FIR (box-car) checksum algorithm would work. When
- the bottom N bits of the checksum are zero, declare a block
- boundary. This gives blocks of average size 2**N. If you want more
- certainty, you can enforce a lower limit of 2**(N-3), and upper
- limit of 2**N or 2**(N+1), for example (thereby either creating,
- or ignoring, the boundaries that are defined by the checksum.
-
- These variable-sized chunks will re-synchronise after differences
- in two streams. To make it work with deflate, we flush the
- compression context (the dictionary, we maintain the history,
- losing 1-2% of compression) on each block boundary… but I'm not
- sure that compression is necessary for obnam.
-
- By choosing N appropriately, you get the block size you want. We
- used N=12, for internet delivery (using HTTP subrange requests) of
- updated files. For each file, we publish a catalog of Adler32 and
- SHA-1 block checksums. Clients download that using HTTP, then
- analyse their files before making requests for blocks they lack.
-
- The invention of doing this in a deflate stream is due to Tim
- Adam. When we discover we already have a given block, we can prime
- the decompressor with that history before decompressing a received
- update block. Really it should be implemented using 7zip instead
- of deflate; a 32KB quotation history is too small.
-
----
-
-I haven't tried this yet. It will require a new repository format
-version, so I've been working on adding that support instead.
---liw