faq/dedup.mdwn


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68

[[!meta title="De-duplication does not work very well for me"]]

On some kinds of large files, Obnam's de-duplication does not work
very well, even though it should. For example, MySQL dump files
from successive days are mostly the same data, but Obnam does badly
with them. Below is an explanation of how the Obnam de-duplication
works, and why it works badly in some cases.

Obnam does de-duplication by splitting up file data into chunks,
and storing those individually. If two files have the same data,
Obnam re-uses the already backed up chunk. So far, so good. However,
due to performance issues, Obnam currently only notices chunks when
they start at integer multiples of the chunk size.

For example, assume a chunk size of 4 bytes, and the following two
files:

    file 1: AAAABBBBCCCC
    file 2: BBBBCCCCAAAA

In this case, Obnam will easily notice that there are three chunks
("AAAA", "BBBB", and "CCCC"), and will store them only once in the
backup repository. However, consider the following file:

    file 3: xAAAABBBBCCCC

File 3 is identical to file 1, except that a new byte has been
inserted into the file. This makes Obnam look at file 3 as four
chunks: "xAAA", "ABBB", "BCCC", and "C". None of these chunks
match the chunks already in the backup repository. Thus, Obnam
thinks they're all new.

There is no technical reason why Obnam could not notice that file 3
only has one inserted byte. However, doing so would require a very
large number of lookups in the repository, and thus would be quite
slow. There may be better ways of noticing the minute difference,
and perhaps someday one of them will be implemented in Obnam.

Note that Obnam does not do a "diff" (or "xdelta" or other such
approach) to notice differences between successive versions of
files. Doing so would make backup generations be dependent on each
other, and re-introduce "full" versus "incremental" backups in a
way that is not acceptable.

With SQL dumps of databases, there are often small changes at
the beginning of of the file, or in the middle of the file, which
makes Obnam's de-duplication work very badly, even if the data as
such has only changed a tiny bit.

Unfortunately, I don't know of a trick that would make the SQL
dumps work better with Obnam. In any case, you should not have
to munge your live data to suit Obnam: Obnam needs to be able
to deal with whatever data you have. Until Obnam's de-duplication
becomes better, though, perhaps someone would have a workaround?

The best idea, untested, I have is to keep the first SQL dump,
in the live data, and then do a new dump before each backup, diff
the two dumps, delete the new dump, and then run the backup. This
way, each successive Obnam backup generation will have two files
(the original SQL dump, and the diff), and you'll need to apply
the diff to get the real dump you need to restore your database.
Does that make sense to anyone?

See also the mailing list thread:

* [Start](http://vlists.pepperfish.net/pipermail/obnam-flarn.net/2013-January/000510.html)
* [Lars's explanation](http://vlists.pepperfish.net/pipermail/obnam-flarn.net/2013-January/000511.html)
* [Workaround with rdiff](http://vlists.pepperfish.net/pipermail/obnam-flarn.net/2013-January/000516.html)