diff options
Diffstat (limited to 'faq/dedup.mdwn')
-rw-r--r-- | faq/dedup.mdwn | 68 |
1 files changed, 0 insertions, 68 deletions
diff --git a/faq/dedup.mdwn b/faq/dedup.mdwn deleted file mode 100644 index 70efc49..0000000 --- a/faq/dedup.mdwn +++ /dev/null @@ -1,68 +0,0 @@ -[[!meta title="De-duplication does not work very well for me"]] - -On some kinds of large files, Obnam's de-duplication does not work -very well, even though it should. For example, MySQL dump files -from successive days are mostly the same data, but Obnam does badly -with them. Below is an explanation of how the Obnam de-duplication -works, and why it works badly in some cases. - -Obnam does de-duplication by splitting up file data into chunks, -and storing those individually. If two files have the same data, -Obnam re-uses the already backed up chunk. So far, so good. However, -due to performance issues, Obnam currently only notices chunks when -they start at integer multiples of the chunk size. - -For example, assume a chunk size of 4 bytes, and the following two -files: - - file 1: AAAABBBBCCCC - file 2: BBBBCCCCAAAA - -In this case, Obnam will easily notice that there are three chunks -("AAAA", "BBBB", and "CCCC"), and will store them only once in the -backup repository. However, consider the following file: - - file 3: xAAAABBBBCCCC - -File 3 is identical to file 1, except that a new byte has been -inserted into the file. This makes Obnam look at file 3 as four -chunks: "xAAA", "ABBB", "BCCC", and "C". None of these chunks -match the chunks already in the backup repository. Thus, Obnam -thinks they're all new. - -There is no technical reason why Obnam could not notice that file 3 -only has one inserted byte. However, doing so would require a very -large number of lookups in the repository, and thus would be quite -slow. There may be better ways of noticing the minute difference, -and perhaps someday one of them will be implemented in Obnam. - -Note that Obnam does not do a "diff" (or "xdelta" or other such -approach) to notice differences between successive versions of -files. Doing so would make backup generations be dependent on each -other, and re-introduce "full" versus "incremental" backups in a -way that is not acceptable. - -With SQL dumps of databases, there are often small changes at -the beginning of of the file, or in the middle of the file, which -makes Obnam's de-duplication work very badly, even if the data as -such has only changed a tiny bit. - -Unfortunately, I don't know of a trick that would make the SQL -dumps work better with Obnam. In any case, you should not have -to munge your live data to suit Obnam: Obnam needs to be able -to deal with whatever data you have. Until Obnam's de-duplication -becomes better, though, perhaps someone would have a workaround? - -The best idea, untested, I have is to keep the first SQL dump, -in the live data, and then do a new dump before each backup, diff -the two dumps, delete the new dump, and then run the backup. This -way, each successive Obnam backup generation will have two files -(the original SQL dump, and the diff), and you'll need to apply -the diff to get the real dump you need to restore your database. -Does that make sense to anyone? - -See also the mailing list thread: - -* [Start](https://listmaster.pepperfish.net/pipermail/obnam-support-obnam.org/2013-January/002008.html) -* [Lars's explanation](https://listmaster.pepperfish.net/pipermail/obnam-support-obnam.org/2013-January/002009.html) -* [Workaround with rdiff](https://listmaster.pepperfish.net/pipermail/obnam-support-obnam.org/2013-January/002014.html) |