faq/checksum-safety.mdwn


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35

[[!meta title="Checksum collisions and safety"]]

Obnam is using the MD5 checksum algorithm for recognising duplicate
data chunks. MD5 has a reputation for being unsafe: people have
constructed files that are different, but result in the same MD5
checksum. This is true.

Every checksum algorithm can have collisions. Changing Obnam to, say,
SHA1, SHA2, or the as yet unreleased SHA3 would not remove the chance
of collisions. It would reduce the chance of accidental collisions,
but the chance of those is already so small with MD5 that it can be
disregarded. Or put in another way, if you care about the chance of
accidental MD5 collisions, you should be caring about accidental SHA1,
SHA2, or SHA3 collisions as well.

Apart from accidental collisions, there are two cases where you should
worry about checksum collisions (regardless of algorithm).

First, if you're into researching checksum collisions, you're likely
to have files that cause checksum collisions, and in that case, if you
restore after a catastrophe, you probably want to get the files back
intact, rather having Obnam confuse one with the other.

Second, if you have an enemy who wishes to corrupt your backed up
data, they may replace some of the backed up data with other data that
has the same checksum. This way, when you restore, your data is
corrupted without Obnam noticing.

For both of these cases, you can instruct Obnam to **verify** that
chunks of data with the same checksum actually are the same data,
instead of relying on the checksum alone. This is as safe as it can
be, but it has a big performance impact. It causes Obnam to have to
read from the repository (possibly downloading it from your backup
server) all the data you are backing up. You'll still benefit from the
de-duplication, however, so your repository size will be smaller.