From fb313c774c1edaae5d014793086ebbe174aac567 Mon Sep 17 00:00:00 2001 From: Lars Wirzenius Date: Tue, 22 Apr 2014 18:49:48 +0100 Subject: Add preliminary content, from old site --- 1.0.mdwn | 328 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 328 insertions(+) create mode 100644 1.0.mdwn (limited to '1.0.mdwn') diff --git a/1.0.mdwn b/1.0.mdwn new file mode 100644 index 0000000..a49a246 --- /dev/null +++ b/1.0.mdwn @@ -0,0 +1,328 @@ +[[!meta title="Obnam 1.0 (backup software); a story in many words"]] + +**tl;dr**: Version 1.0 of [Obnam](http://liw.fi/obnam/), my +snapshotting, de-duplicating, encrypting backup program is released. +See the end of this announcement for the details. + +Where we see the hero in his formative years; parental influence +---------------------------------------------------------------- + +From the very beginning, my computing life has involved backups. + +In 1984, when I was 14, +[my father](http://www.kolumbus.fi/arnow/) was an independent +telecommunications consultant, which meant he needed a personal computer +for writing reports. He bought a +[Luxor ABC-802](http://en.wikipedia.org/wiki/ABC_800#ABC_802), +a Swedish computer with a Z80 microprocessor and two floppy drives. + +My father also taught me how to use it. When I needed to +save files, he gave me not one, but two floppies, and explained +that I should store my files one one, and then copy them to the +other one every now and then. + +Later on, over the years, I've made backups from a hard disk +(30 megabytes!) to +a stack of floppies, to a tape drive installed into +a floppy interface (400 megabytes!), to a DAT drive, and various other media. +It was always a bit tedious. + +The start of the quest; lengthy justification for NIH +----------------------------------------------------- + +In 2004, I decided to do a full backup, by burning a copy of all my +files onto CD-R disks. It took me most of the day. Afterwards, I sat +admiring the large stack of disks, and realized that I would not ever +do that again. I'm too lazy for that. That I had done it once was an +aberration in the space-time continuum. + +Switching to DVD-Rs instead CD-Rs would reduce to the number of disks to +burn, but not enough: it would still take a stack of them. +I needed something much better. + +I had a little experience with tape drives, and that was enough to convince +me that I didn't want them. Tape drives are expensive hardware, +and the tapes also cost money. If the drive goes bad, you have to get +a compatible one, or all your backups are toast. The price per gigabyte +was coming down fast for hard drives, and it was clear that they were +about to be very competitive with tapes for price. + +I looked for backup programs that I could use for disk based backups. +`rsync`, of course, was the obvious choice, but there were others. +I ended up doing what many geeks do: I wrote my own wrapper around +`rsync`. There's hundred, possibly thousands, of such wrappers around +the Internet. + +I also got the idea that doing a startup to provide online backup +space would be a really cool thing. However, I didn't really do +anything about that until 2007. More on that later. + +The `rsync` wrapper script I wrote used hardlinked directory trees +to provide a backup history, though not in the smart way that +[backuppc](http://backuppc.sourceforge.net/) does it. +The hardlinks were wonderful, because they were +cheap, and provided de-duplication. They were also quite cumbersome, +when I needed to move my backups to a new disk the first time. It +turned out that a lot of tools deal very badly with directory trees +with large numbers of hardlinks. + +I also decided I wanted encrypted backups. This led me to find +[duplicity](http://duplicity.nongnu.org/), which is a nice program +that does encrypted backups, but I had issues with some of its +limitations. To fix those limitations, I would have had to re-design +and possibly re-implement the entire program. The biggest limitation +was that it treated backups as full backup, plus a sequence of +incremental backups, which were deltas against the previous backup. + +Delta based incrementals make sense for tape drives. You run a full +backup once, then incremental deltas for every day. When enough time +has passed since the full backup, you do a new full backup, and then +future incrementals are based on that. Repeat forever. + +I decided that this makes no sense for disk based backups. If I already +have backed up a file, there's no point in making me backup it again, +since it's already there on the same hard disk. It makes even less +sense for online backups, since doing a new full backup would require +me to transmit all the data all over again, even though it's already +on the server. + +The first battle +---------------- + +I could not find a program that did what I wanted to do, and like +every good [NIHolic](http://en.wikipedia.org/wiki/Not_Invented_Here), +I started writing my own. + +After various aborted attempts, I started for real in 2006. Here is +the first commit message: + + revno: 1 + committer: Lars Wirzenius + branch nick: wibbr + timestamp: Wed 2006-09-06 18:35:52 +0300 + message: + Initial commit. + +`wibbr` was the placeholder name for Obnam until we came up with +something better. We was myself and Richard Braakman, who was going +to be doing the backup startup with me. We eventually founded the +company near the end of 2006, and started doing business in 2007. + +However, we did not do very much business, and ran out of money in +September 2007. We ended the backup startup experiment. +That's when I took a job with Canonical, and Obnam became a hobby +project of mine: I still wanted a good backup tool. + +In September 2007, Obnam was working, but it was not very good. +For example, it was quite slow and wasteful of backup space. + +That version of Obnam used deltas, based on the `rsync` algorithm, to +backup only changes. It did not require the user to do full and +incremental backups manually, but essentially created an endless +sequence of incrementals. It was possible to remove any generation, +and Obnam would manage the deltas as necessary, keeping the ones +needed for the remaining generations, and removing the rest. +Obnam made it look as if each generation was independent of each other. + +The wasteful part was the way in which metadata about files was +stored: each generation stored the full list of filenames and their +permissions and other inode fields. This turned out to be bigger +than my daily delta. + +The lost years; getting lost in the forest +------------------------------------------ + +For the next two years, I did a little work on Obnam, but I did not +make progress very fast. I changed the way metadata was stored, for +example, but I picked another bad way of doing it: the new way was +essentially building a tree of directory and file nodes, and any +unchanged subtrees were shared between generations. This reduced the +space overhead per generation, but made it quite slow to look up +the metadata for any one file. + +The final battle; finding cows in the forest +-------------------------------------------- + +In 2009 I decided to leave Canonical and after that, my Obnam hobby +picked up in speed again. Below is a table of the number of commits +per year, from the very first commit (`bzr log -n0 | +awk '/timestamp:/ { print $3}' | sed 's/-.*//' | uniq -c | +awk '{ print $2, $1 }' | tac`): + + 2006 466 + 2007 353 + 2008 402 + 2009 467 + 2010 616 + 2011 790 + 2012 282 + +During most of 2010 and 2011 I was unemployed, and happily hacking +Obnam, while moving to another country twice. I don't recommend that +as a way to hack on hobby projects, but it worked for me. + +After Canonical, I decided to tackle the way Obnam stores data from +a new angle. Richard told me about the copy-on-write (or COW) B-trees that +btrfs uses, originally designed by Ohad Rodeh +(see [his paper](http://liw.fi/larch/ohad-btrees-shadowing-clones.pdf) +for details), +and I started reading about that. It turned out that +they're pretty ideal for backups: each B-tree stores data about +one generation. To start a new generation, you clone the previous +generation's B-tree, and make any modifications you need. + +I implemented the B-tree library myself, in Python. +I wanted something that +was flexible about how and where I stored data, which the btrfs +implementation did not seem to give me. (Also, I worship at the +altar of NIH.) + +With the B-trees, doing file deltas from the previous generation +no longer made any sense. I realized that it was, in any case, a +better idea to store file data in chunks, and re-use chunks in +different generations as needed. This makes it much easier to +manage changes to files: with deltas, you need to keep a long chain +of deltas and apply many deltas to reconstruct a particular version. +With lists of chunks, you just get the chunks you need. + +The spin-off franchise; lost in a maze of dependencies, all alike +----------------------------------------------------------------- + +In the process of developing Obnam, I have split off a number of +helper programs and libraries: + +* [genbackupdata](http://liw.fi/genbackupdata/) + generates reproducible test data for backups +* [seivot](http://liw.fi/seivot/) + runs benchmarks on backup software (although only Obnam for now) +* [cliapp](http://liw.fi/cliapp/) + is a Python framework for command line applications +* [cmdtest](http://liw.fi/cmdtest/) + runs black box tests for Unix command line applications +* [summain](http://liw.fi/summain/) + makes diff-able file manifests (`md5sum` on steroids), + useful for verifying that files are restored correctly +* [tracing](http://liw.fi/tracing/) + allows run-time selectable debug log messages that is really + fast during normal production runs when messages are not printed + +I have found it convenient to keep these split off, since I've been +able to use them in other projects as well. However, it turns out that +those installing Obnam don't like this: it would probably make sense to +have a fat release with Obnam and all dependencies, but I haven't bothered +to do that yet. + +The blurb; readers advised about blatant marketing +-------------------------------------------------- + +The strong points of Obnam are, I think: + +* **Snapshot** backups, similar to btrfs snapshot subvolumes. + Every generation looks like a complete snapshot, + so you don't need to care about full versus incremental backups, or + rotate real or virtual tapes. + The generations share data as much as possible, + so only changes are backed up each time. +* Data **de-duplication**, across files, and backup generations. If the + backup repository already contains a particular chunk of data, it will + be re-used, even if it was in another file in an older backup + generation. This way, you don't need to worry about moving around large + files, or modifying them. +* **Encrypted** backups, using GnuPG. + +Backups may be stored on local hard disks (e.g., USB drives), any +locally mounted network file shares (NFS, SMB, almost anything with +remotely Posix-like semantics), or on any SFTP server you have access to. + +What's not so strong is backing up online over SFTP, particularly with +long round trip times to the server, or many small files to back up. +That performance is Obnam's weakest part. I hope to fix that in the future, +but I don't want to delay 1.0 for it. + +The big news; readers sighing in relief +--------------------------------------- + +I am now ready to release version 1.0 of Obnam. Finally. It's been +a long project, much longer than I expected, and much longer than +was really sensible. However, it's ready now. It's not bug free, and +it's not as fast as I would like, but it's time to declare it ready +for general use. If nothing else, this will get more people to use +it, and they'll find the remaining problems faster than I can do on +my own. + +I have packaged Obnam for Debian, and it is in `unstable`, and will +hopefully get into `wheezy` before the Debian freeze. I provide +packages built for `squeeze` on my own repository, +see the [download](http://liw.fi/obnam/download/) page. + +The changes in the 1.0 release compared to the previous one: + +* Fixed bug in finding duplicate files during a backup generation. + Thanks to Saint Germain for reporting the problem. +* Changed version number to 1.0. + +The future; not including winning lottery numbers +------------------------------------------------- + +I expect to get a flurry of bug reports in the near future as new people +try Obnam. It will take a bit of effort dealing with that. Help is, of +course, welcome! + +After that, I expect to be mainly working on Obnam performance for the +foreseeable future. There may also be a FUSE filesystem interface for +restoring from backups, and a continous backup version of Obnam. Plus +other features, too. + +I make no promises about how fast new features +and optimizations will happen: Obnam is a hobby project for me, and I +work on it only in my free time. Also, I have a bunch of things that +are on hold until I get Obnam into shape, and I may decide to do one +of those things before the next big Obnam push. + +Where; the trail of an errant hacker +------------------------------------ + +I've developed Obnam in a number of physical locations, and I thought +it might be interesting to list them: +Espoo, Helsinki, Vantaa, Kotka, Raahe, Oulu, Tampere, Cambridge, Boston, +Plymouth, London, Los Angeles, Auckland, Wellington, Christchurch, +Portland, New York, Edinburgh, Manchester, San Giorgio di Piano. +I've also hacked on Obnam in trains, on planes, and once on a ship, +but only for a few minutes on the ship before I got seasick. + +Thank you; sincerely +-------------------- + +* Richard Braakman, for helping me with ideas, feedback, and some + code optimizations, and for doing the startup with me. Even though + he has provided little code, he's Obnam's most significant contributor + so far. +* [Chris Cormack](http://blog.bigballofwax.co.nz/), for helping to build + Obnam for Ubuntu. I no longer use Ubuntu at all, so it's a big help to + not have to worry about building and testing packages for it. +* [Daniel Silverstone](http://www.digital-scurf.org/), for spending a + Saturday with me hacking Obnam, and rewriting the way repository file + filters work (compression, encryption), thus making them not suck. +* [Tapani Tarvainen](http://tapani.tarvainen.info/) for running Obnam for + serious amounts of real data, and for being patient while I fixed things. +* [Soile Mottisenkangas](http://docstory.fi/) for believing in me, and + helping me overcome periods of despair. +* Everyone else who has tried Obnam and reported bugs or provided any + other feedback. I apologize for not listing everyone. + +SEE ALSO +-------- + +* [Obnam home page](http://liw.fi/obnam/) + - [tutorial](http://liw.fi/obnam/tutorial/) + - [support](http://liw.fi/obnam/status/) + - [NEWS](http://liw.fi/obnam/NEWS/) + - [README](http://liw.fi/obnam/README/) + - [manual page](http://liw.fi/obnam/obnam.1.txt) + - [design documents](http://liw.fi/obnam/development/) + - [Debian QA package page](http://packages.qa.debian.org/o/obnam.html) + - [bugs](http://liw.fi/obnam/bugs/) + - [bugs in Debian](http://bugs.debian.org/obnam) +* [Other projects of mine (many are dependencies of + Obnam)](http://liw.fi/tag/program/) + -- cgit v1.2.1