diff options
Diffstat (limited to 'bugs/more-details-dedup.mdwn')
-rw-r--r-- | bugs/more-details-dedup.mdwn | 45 |
1 files changed, 0 insertions, 45 deletions
diff --git a/bugs/more-details-dedup.mdwn b/bugs/more-details-dedup.mdwn deleted file mode 100644 index c00e3b5..0000000 --- a/bugs/more-details-dedup.mdwn +++ /dev/null @@ -1,45 +0,0 @@ -The following suggestion was sent to me by private e-mail (not sure if -I can show the sender's name): - - Date: Mon, 14 Oct 2013 11:48:41 +1100 - Subject: Re: Obnam repo size with .sql dump files seem too big - - Lars, - - We implemented a strategy for identifying repeated chunks, even in - gzip-compressed files that have changes between versions. It might - work for obnam. The downside is it creates variable-sized chunks, - though you can set upper and lower limits. - - Firstly, we identify chunk boundaries by using a rolling checksum - like Adler32, rolling over say a 128 byte contiguous region. Any - other efficient FIR (box-car) checksum algorithm would work. When - the bottom N bits of the checksum are zero, declare a block - boundary. This gives blocks of average size 2**N. If you want more - certainty, you can enforce a lower limit of 2**(N-3), and upper - limit of 2**N or 2**(N+1), for example (thereby either creating, - or ignoring, the boundaries that are defined by the checksum. - - These variable-sized chunks will re-synchronise after differences - in two streams. To make it work with deflate, we flush the - compression context (the dictionary, we maintain the history, - losing 1-2% of compression) on each block boundary… but I'm not - sure that compression is necessary for obnam. - - By choosing N appropriately, you get the block size you want. We - used N=12, for internet delivery (using HTTP subrange requests) of - updated files. For each file, we publish a catalog of Adler32 and - SHA-1 block checksums. Clients download that using HTTP, then - analyse their files before making requests for blocks they lack. - - The invention of doing this in a deflate stream is due to Tim - Adam. When we discover we already have a given block, we can prime - the decompressor with that history before decompressing a received - update block. Really it should be implemented using 7zip instead - of deflate; a 32KB quotation history is too small. - ---- - -I haven't tried this yet. It will require a new repository format -version, so I've been working on adding that support instead. ---liw |