summaryrefslogtreecommitdiff
path: root/faq/dedup.mdwn
diff options
context:
space:
mode:
Diffstat (limited to 'faq/dedup.mdwn')
-rw-r--r--faq/dedup.mdwn68
1 files changed, 0 insertions, 68 deletions
diff --git a/faq/dedup.mdwn b/faq/dedup.mdwn
deleted file mode 100644
index 70efc49..0000000
--- a/faq/dedup.mdwn
+++ /dev/null
@@ -1,68 +0,0 @@
-[[!meta title="De-duplication does not work very well for me"]]
-
-On some kinds of large files, Obnam's de-duplication does not work
-very well, even though it should. For example, MySQL dump files
-from successive days are mostly the same data, but Obnam does badly
-with them. Below is an explanation of how the Obnam de-duplication
-works, and why it works badly in some cases.
-
-Obnam does de-duplication by splitting up file data into chunks,
-and storing those individually. If two files have the same data,
-Obnam re-uses the already backed up chunk. So far, so good. However,
-due to performance issues, Obnam currently only notices chunks when
-they start at integer multiples of the chunk size.
-
-For example, assume a chunk size of 4 bytes, and the following two
-files:
-
- file 1: AAAABBBBCCCC
- file 2: BBBBCCCCAAAA
-
-In this case, Obnam will easily notice that there are three chunks
-("AAAA", "BBBB", and "CCCC"), and will store them only once in the
-backup repository. However, consider the following file:
-
- file 3: xAAAABBBBCCCC
-
-File 3 is identical to file 1, except that a new byte has been
-inserted into the file. This makes Obnam look at file 3 as four
-chunks: "xAAA", "ABBB", "BCCC", and "C". None of these chunks
-match the chunks already in the backup repository. Thus, Obnam
-thinks they're all new.
-
-There is no technical reason why Obnam could not notice that file 3
-only has one inserted byte. However, doing so would require a very
-large number of lookups in the repository, and thus would be quite
-slow. There may be better ways of noticing the minute difference,
-and perhaps someday one of them will be implemented in Obnam.
-
-Note that Obnam does not do a "diff" (or "xdelta" or other such
-approach) to notice differences between successive versions of
-files. Doing so would make backup generations be dependent on each
-other, and re-introduce "full" versus "incremental" backups in a
-way that is not acceptable.
-
-With SQL dumps of databases, there are often small changes at
-the beginning of of the file, or in the middle of the file, which
-makes Obnam's de-duplication work very badly, even if the data as
-such has only changed a tiny bit.
-
-Unfortunately, I don't know of a trick that would make the SQL
-dumps work better with Obnam. In any case, you should not have
-to munge your live data to suit Obnam: Obnam needs to be able
-to deal with whatever data you have. Until Obnam's de-duplication
-becomes better, though, perhaps someone would have a workaround?
-
-The best idea, untested, I have is to keep the first SQL dump,
-in the live data, and then do a new dump before each backup, diff
-the two dumps, delete the new dump, and then run the backup. This
-way, each successive Obnam backup generation will have two files
-(the original SQL dump, and the diff), and you'll need to apply
-the diff to get the real dump you need to restore your database.
-Does that make sense to anyone?
-
-See also the mailing list thread:
-
-* [Start](https://listmaster.pepperfish.net/pipermail/obnam-support-obnam.org/2013-January/002008.html)
-* [Lars's explanation](https://listmaster.pepperfish.net/pipermail/obnam-support-obnam.org/2013-January/002009.html)
-* [Workaround with rdiff](https://listmaster.pepperfish.net/pipermail/obnam-support-obnam.org/2013-January/002014.html)