1.0.mdwn


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330

[[!meta title="Obnam 1.0 (backup software); a story in many words"]]
[[!meta date="Fri,  8 Jun 2012 07:09:33 +0000"]]

Date: Fri,  8 Jun 2012 07:09:33 +0000

**tl;dr:** Version 1.0 of [Obnam](http://liw.fi/obnam/), my 
snapshotting, de-duplicating, encrypting backup program is released.
See the end of this announcement for the details.

Where we see the hero in his formative years; parental influence
----------------------------------------------------------------

From the very beginning, my computing life has involved backups.

In 1984, when I was 14, 
[my father](http://www.kolumbus.fi/arnow/) was an independent 
telecommunications consultant, which meant he needed a personal computer
for writing reports. He bought a 
[Luxor ABC-802](http://en.wikipedia.org/wiki/ABC_800#ABC_802), 
a Swedish computer with a Z80 microprocessor and two floppy drives.

My father also taught me how to use it. When I needed to
save files, he gave me not one, but two floppies, and explained
that I should store my files one one, and then copy them to the
other one every now and then.

Later on, over the years, I've made backups from a hard disk 
(30 megabytes!) to
a stack of floppies, to a tape drive installed into
a floppy interface (400 megabytes!), to a DAT drive, and various other media.
It was always a bit tedious.

The start of the quest; lengthy justification for NIH
-----------------------------------------------------

In 2004, I decided to do a full backup, by burning a copy of all my
files onto CD-R disks. It took me most of the day. Afterwards, I sat
admiring the large stack of disks, and realized that I would not ever
do that again. I'm too lazy for that. That I had done it once was an
aberration in the space-time continuum.

Switching to DVD-Rs instead CD-Rs would reduce to the number of disks to 
burn, but not enough: it would still take a stack of them. 
I needed something much better.

I had a little experience with tape drives, and that was enough to convince
me that I didn't want them. Tape drives are expensive hardware,
and the tapes also cost money. If the drive goes bad, you have to get
a compatible one, or all your backups are toast. The price per gigabyte
was coming down fast for hard drives, and it was clear that they were
about to be very competitive with tapes for price.

I looked for backup programs that I could use for disk based backups.
`rsync`, of course, was the obvious choice, but there were others.
I ended up doing what many geeks do: I wrote my own wrapper around
`rsync`. There's hundred, possibly thousands, of such wrappers around
the Internet.

I also got the idea that doing a startup to provide online backup
space would be a really cool thing. However, I didn't really do
anything about that until 2007. More on that later.

The `rsync` wrapper script I wrote used hardlinked directory trees
to provide a backup history, though not in the smart way that
[backuppc](http://backuppc.sourceforge.net/) does it. 
The hardlinks were wonderful, because they were
cheap, and provided de-duplication. They were also quite cumbersome,
when I needed to move my backups to a new disk the first time. It
turned out that a lot of tools deal very badly with directory trees
with large numbers of hardlinks.

I also decided I wanted encrypted backups. This led me to find
[duplicity](http://duplicity.nongnu.org/), which is a nice program
that does encrypted backups, but I had issues with some of its
limitations. To fix those limitations, I would have had to re-design
and possibly re-implement the entire program. The biggest limitation 
was that it treated backups as full backup, plus a sequence of 
incremental backups, which were deltas against the previous backup.

Delta based incrementals make sense for tape drives. You run a full
backup once, then incremental deltas for every day. When enough time
has passed since the full backup, you do a new full backup, and then
future incrementals are based on that. Repeat forever.

I decided that this makes no sense for disk based backups. If I already
have backed up a file, there's no point in making me backup it again,
since it's already there on the same hard disk. It makes even less
sense for online backups, since doing a new full backup would require
me to transmit all the data all over again, even though it's already
on the server.

The first battle
----------------

I could not find a program that did what I wanted to do, and like
every good [NIHolic](http://en.wikipedia.org/wiki/Not_Invented_Here), 
I started writing my own.

After various aborted attempts, I started for real in 2006. Here is
the first commit message:

    revno: 1
    committer: Lars Wirzenius <liw@iki.fi>
    branch nick: wibbr
    timestamp: Wed 2006-09-06 18:35:52 +0300
    message:
      Initial commit.

`wibbr` was the placeholder name for Obnam until we came up with
something better. We was myself and Richard Braakman, who was going
to be doing the backup startup with me. We eventually founded the
company near the end of 2006, and started doing business in 2007.

However, we did not do very much business, and ran out of money in
September 2007. We ended the backup startup experiment.
That's when I took a job with Canonical, and Obnam became a hobby
project of mine: I still wanted a good backup tool. 

In September 2007, Obnam was working, but it was not very good. 
For example, it was quite slow and wasteful of backup space.

That version of Obnam used deltas, based on the `rsync` algorithm, to
backup only changes. It did not require the user to do full and
incremental backups manually, but essentially created an endless
sequence of incrementals. It was possible to remove any generation,
and Obnam would manage the deltas as necessary, keeping the ones
needed for the remaining generations, and removing the rest. 
Obnam made it look as if each generation was independent of each other.

The wasteful part was the way in which metadata about files was
stored: each generation stored the full list of filenames and their
permissions and other inode fields. This turned out to be bigger
than my daily delta.

The lost years; getting lost in the forest
------------------------------------------

For the next two years, I did a little work on Obnam, but I did not
make progress very fast. I changed the way metadata was stored, for
example, but I picked another bad way of doing it: the new way was
essentially building a tree of directory and file nodes, and any
unchanged subtrees were shared between generations. This reduced the
space overhead per generation, but made it quite slow to look up
the metadata for any one file.

The final battle; finding cows in the forest
--------------------------------------------

In 2009 I decided to leave Canonical and after that, my Obnam hobby 
picked up in speed again. Below is a table of the number of commits
per year, from the very first commit (`bzr log -n0 | 
awk '/timestamp:/ { print $3}' | sed 's/-.*//' | uniq -c | 
awk '{ print $2, $1 }' | tac`):

    2006 466
    2007 353
    2008 402
    2009 467
    2010 616
    2011 790
    2012 282

During most of 2010 and 2011 I was unemployed, and happily hacking
Obnam, while moving to another country twice. I don't recommend that
as a way to hack on hobby projects, but it worked for me.

After Canonical, I decided to tackle the way Obnam stores data from
a new angle. Richard told me about the copy-on-write (or COW) B-trees that
btrfs uses, originally designed by Ohad Rodeh
(see [his paper](http://liw.fi/larch/ohad-btrees-shadowing-clones.pdf)
for details),
and I started reading about that. It turned out that
they're pretty ideal for backups: each B-tree stores data about
one generation. To start a new generation, you clone the previous
generation's B-tree, and make any modifications you need.

I implemented the B-tree library myself, in Python. 
I wanted something that
was flexible about how and where I stored data, which the btrfs
implementation did not seem to give me. (Also, I worship at the 
altar of NIH.)

With the B-trees, doing file deltas from the previous generation
no longer made any sense. I realized that it was, in any case, a
better idea to store file data in chunks, and re-use chunks in
different generations as needed. This makes it much easier to
manage changes to files: with deltas, you need to keep a long chain
of deltas and apply many deltas to reconstruct a particular version.
With lists of chunks, you just get the chunks you need.

The spin-off franchise; lost in a maze of dependencies, all alike
-----------------------------------------------------------------

In the process of developing Obnam, I have split off a number of
helper programs and libraries:

* [genbackupdata](http://liw.fi/genbackupdata/) 
  generates reproducible test data for backups
* [seivot](http://liw.fi/seivot/) 
  runs benchmarks on backup software (although only Obnam for now)
* [cliapp](http://liw.fi/cliapp/) 
  is a Python framework for command line applications
* [cmdtest](http://liw.fi/cmdtest/) 
  runs black box tests for Unix command line applications
* [summain](http://liw.fi/summain/) 
  makes diff-able file manifests (`md5sum` on steroids),
  useful for verifying that files are restored correctly
* [tracing](http://liw.fi/tracing/) 
  allows run-time selectable debug log messages that is really
  fast during normal production runs when messages are not printed
  
I have found it convenient to keep these split off, since I've been
able to use them in other projects as well. However, it turns out that
those installing Obnam don't like this: it would probably make sense to
have a fat release with Obnam and all dependencies, but I haven't bothered
to do that yet.

The blurb; readers advised about blatant marketing
--------------------------------------------------

The strong points of Obnam are, I think:

* **Snapshot** backups, similar to btrfs snapshot subvolumes. 
  Every generation looks like a complete snapshot,
  so you don't need to care about full versus incremental backups, or
  rotate real or virtual tapes.
  The generations share data as much as possible,
  so only changes are backed up each time.
* Data **de-duplication**, across files, and backup generations. If the
  backup repository already contains a particular chunk of data, it will
  be re-used, even if it was in another file in an older backup
  generation. This way, you don't need to worry about moving around large
  files, or modifying them.
* **Encrypted** backups, using GnuPG.

Backups may be stored on local hard disks (e.g., USB drives), any
locally mounted network file shares (NFS, SMB, almost anything with
remotely Posix-like semantics), or on any SFTP server you have access to.

What's not so strong is backing up online over SFTP, particularly with
long round trip times to the server, or many small files to back up. 
That performance is Obnam's weakest part. I hope to fix that in the future,
but I don't want to delay 1.0 for it.

The big news; readers sighing in relief
---------------------------------------

I am now ready to release version 1.0 of Obnam. Finally. It's been
a long project, much longer than I expected, and much longer than
was really sensible. However, it's ready now. It's not bug free, and
it's not as fast as I would like, but it's time to declare it ready
for general use. If nothing else, this will get more people to use
it, and they'll find the remaining problems faster than I can do on
my own.

I have packaged Obnam for Debian, and it is in `unstable`, and will
hopefully get into `wheezy` before the Debian freeze. I provide
packages built for `squeeze` on my own repository,
see the [download](http://liw.fi/obnam/download/) page.

The changes in the 1.0 release compared to the previous one:

* Fixed bug in finding duplicate files during a backup generation.
  Thanks to Saint Germain for reporting the problem.
* Changed version number to 1.0.

The future; not including winning lottery numbers
-------------------------------------------------

I expect to get a flurry of bug reports in the near future as new people
try Obnam. It will take a bit of effort dealing with that. Help is, of
course, welcome!

After that, I expect to be mainly working on Obnam performance for the
foreseeable future. There may also be a FUSE filesystem interface for
restoring from backups, and a continous backup version of Obnam. Plus
other features, too. 

I make no promises about how fast new features
and optimizations will happen: Obnam is a hobby project for me, and I
work on it only in my free time. Also, I have a bunch of things that
are on hold until I get Obnam into shape, and I may decide to do one
of those things before the next big Obnam push.

Where; the trail of an errant hacker
------------------------------------

I've developed Obnam in a number of physical locations, and I thought
it might be interesting to list them:
Espoo, Helsinki, Vantaa, Kotka, Raahe, Oulu, Tampere, Cambridge, Boston,
Plymouth, London, Los Angeles, Auckland, Wellington, Christchurch,
Portland, New York, Edinburgh, Manchester, San Giorgio di Piano.
I've also hacked on Obnam in trains, on planes, and once on a ship,
but only for a few minutes on the ship before I got seasick.

Thank you; sincerely
--------------------

* Richard Braakman, for helping me with ideas, feedback, and some
  code optimizations, and for doing the startup with me. Even though
  he has provided little code, he's Obnam's most significant contributor
  so far.
* [Chris Cormack](http://blog.bigballofwax.co.nz/), for helping to build
  Obnam for Ubuntu. I no longer use Ubuntu at all, so it's a big help to
  not have to worry about building and testing packages for it.
* [Daniel Silverstone](http://www.digital-scurf.org/), for spending a
  Saturday with me hacking Obnam, and rewriting the way repository file
  filters work (compression, encryption), thus making them not suck.
* [Tapani Tarvainen](http://tapani.tarvainen.info/) for running Obnam for
  serious amounts of real data, and for being patient while I fixed things.
* [Soile Mottisenkangas](http://docstory.fi/) for believing in me, and
  helping me overcome periods of despair.
* Everyone else who has tried Obnam and reported bugs or provided any
  other feedback. I apologize for not listing everyone.

SEE ALSO
--------

* [Obnam home page](http://liw.fi/obnam/)
  - [support](http://liw.fi/obnam/status/)
  - [NEWS](http://liw.fi/obnam/NEWS/)
  - [README](http://liw.fi/obnam/README/)
  - [manual page](http://liw.fi/obnam/obnam.1.txt)
  - [design documents](http://liw.fi/obnam/development/)
  - [Debian QA package page](http://packages.qa.debian.org/o/obnam.html)
  - [bugs](http://liw.fi/obnam/bugs/)
  - [bugs in Debian](http://bugs.debian.org/obnam)
* [Other projects of mine (many are dependencies of 
  Obnam)](http://liw.fi/tag/program/)