summaryrefslogtreecommitdiff
path: root/manual/en/060-backing-up.mdwn
blob: ee28d6e25bc3d00a5f4c8a3e8136a6aae3c52ab8 (plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
Backing up
==========

This chapter discusses the various aspects of making backups with
Obnam.

Your first backup
-----------------

Let's make a backup! To walk through the examples in this directory,
you need to have some live data to backup. The examples use specific
filenames for this. You'll need to adapt the examples to your own
files. The examples assume your home directory is `/home/tomjon`, and
that you have a directory called `Documents` in your home directory
for your documents. Further, it assumes you have a USB drive mounted
at `/media/backups`, and that you will be using a directory
`tomjon-repo` on that drive as the backup repository.

With those assumptions, here's how you would backup your documents:

    obnam backup -r /media/backups/tomjon-repo ~/Documents
    
That's all. It will take a little while, if you have a lot of
documents, but eventually it'll look something like this:

    Backed up 11 files (of 11 found),
    uploaded 97.7 KiB in 0s at 647.2 KiB/s average speed       

(In reality, the above text will be all on one line, but that didn't
fit in this manual's line width.)

This tells you that Obnam found a total of eleven files, of which it
backed up all eleven. The files contained a total of about a hundred
kilobytes of data, and that the upload speed for that data was over
six hundred kilobytes per second. The actual units are using IEC
prefixes, which are base-2, to avoid ambiguity. See
[Wikipedia on kibibytes] for more information.

[Wikipedia on kibibytes]: https://en.wikipedia.org/wiki/Kibibyte

Your first backup run should probably be quite small to see that
all settings are right without having to wait a long time. You may
want to choose a small directory to start with, instead of your entire
home directory.

Your second backup
------------------

Once you've run your first backup, you'll want to run a second one.
It's done the same way:

    obnam backup -r /media/backups/tomjon-repo ~/Documents

Note that you don't need to tell Obnam whether you want a full backup
or an incremental backup. Obnam makes each backup generation be a
snapshot of the data at the time of the backup, and doesn't make a
difference between full and incremental backups. Each backup
generation is equal to each other backup generation. This doesn't mean
that each generation will store all the data separately. Obnam makes
sure each new generation only backs up data that isn't already in the
repository. Obnam finds that data in any file in any previous
generation, amongst all the clients sharing the same repository.

We'll later cover how to remove backup generations, and you'll learn
that Obnam can remove any generation, even if it shares some of the
data with other generations, without those other generations losing
any data.

After you've your second backup generation, you'll want to see the
generations you have:

    $ obnam generations -r /media/backups/tomjon-repo
    2	2014-02-05 23:13:50 .. 2014-02-05 23:13:50 (14 files, 100000 bytes) 
    5	2014-02-05 23:42:08 .. 2014-02-05 23:42:08 (14 files, 100000 bytes) 

This lists two generations, which have the identifiers 2 and 5. Note
that generation identifiers are not necessarily a simple sequence like
1, 2, 3. This is due to how some of the internal data structures of
Obnam are implemented, and not because its author in any way thinks
it's fun to confuse people.

The two time stamps for each generation are when the backup run
started and when it ended. In addition, for each generation is a count
of files in that generation (total, not just new or changed files),
and the total number of bytes of file content data they have.

Choosing what to backup, and what not to backup
-----------------------------------------------

Obnam needs to be told what to back up, by giving it a list of
directories, known as backup roots. In the examples in this chapter so
far, we've used the directory `~/Documents` (that is, the directory
`Documents` in your home directory) as the backup root. There can be
multiple backup roots:

    obnam -r /media/backups/tomjon-repo ~/Documents ~/Photos

Everything in the backup root directories gets backed up -- unless it's
explicitly excluded. There are several ways to exclude things from
backups:

* The `--exclude` setting uses regular expressions that match the full
  pathname of each file or directory: if the pathname matches, the
  file or directory is not backed up. In fact, Obnam pretends it
  doesn't exist. If a directory matches, then any files and
  sub-directories also get excluded. This can be used, for example, to
  exclude all MP3 files (`--exclude='\.mp3$'`).
* The `--exclude-caches` setting excludes directories that contain a
  special "cache tag" file called `CACHEDIR.TAG`, that starts with a
  specific sequence of bytes. Such a tag file can be created in, for
  example, a Firefox or other web browser cache directory. Those files
  are usually not important to back up, and tagging the directory
  can be easier than constructing a regular expression for
  `--exclude`.
* The `--one-file-system` setting excludes any mount points and the
  contents of the mounted filesystem. This is useful for skipping,
  for example, virtual filesystems such as `/proc`, remote filesystems
  mounted over NFS, and Obnam repositories mounted with `obnam mount`
  (which we'll cover in the next chapter).

In general it is better to back up too much rather than too little.
You should also make sure you know what is and isn't backed up. The
`--pretend` option tells Obnam to run a backup, except it doesn't
change anything in the backup repository, so it's quite fast. This way
you can see what would be backed up, and tweak exclusions as needed.

Storing backups remotely
------------------------

You probably want to store at least one backup remotely, or "off
site". Obnam can make backups over a network, using the SFTP
protocol (part of SSH). You need the following to achieve this:

* A **server** that you can access over SFTP. This can be a server you
  own, a virtual machine ("VPS") you rent, or some other arrangement.
  You could, for example, have a machine at a friends' place, in
  exchange for having one of their machines at your place, so that you
  both can backup remotely.
  
* An **ssh key** for logging into the server. Obnam does not currently
  support logging in via passwords.

* Enough disk space on the server to hold your backups.

Obnam only uses the SFTP part of the SSH connection to access the
server. To test whether it will work, you can run this command:

    sftp USER@SERVER
    
Change `USER@SERVER` to be your actual user and address for your
server. This should say something like `Connected to localhost.` and
you should be able to run the `ls -la` command to see a directory list
of files on the remote side.

Once all of that is set up correctly, to get Obnam to use the server
as a backup repository, you only need to give an SFTP URL:

    obnam -r sftp://USER@SERVER/~/my-precious-backups
    
For details on SFTP URLs, see the next section.

URL syntax
----------

Whenever obnam accepts a URL, it can be either a local pathname, or an
SFTP URL. An SFTP URL has the following form:

    sftp://[user@]domain[:port]/path

where `domain` is a normal Internet domain name for a server, `user`
is your username on that server, `port` is an optional numeric port
number, and `path` is a pathname on the server side. Like **bzr**(1),
but unlike the SFTP URL standard, the pathname is absolute, unless it
starts with `/~/` in which case it is relative to the user's home
directory on the server.

Examples:

* `sftp://liw@backups.pieni.net/~/backup-repo` is the directory
  `backup-repo` in the home directory of the user `liw` on the server
  `backups.pieni.net`. Note that we don't need to know the absolute
  path of the home directory.

* `sftp://root@my.server.example.com/home` is the directory `/home`
  (note absolute path) on the server `my.server.example.com`, and the
  `root` user is used to access the server.

* `sftp://foo.example.com:12765/anti-clown-society` is the directory
  `/anti-clown-society` on the server `foo.example.com`, accessed via
  the port 12765.

You can use SFTP URLs for the repository, or the live data (`--root`),
but note that due to limitations in the protocol, and its
implementation in the paramiko library, some things will not work very
well for accessing live data over SFTP. Most importantly, the handling
of of hardlinks is rather suboptimal. For live data access, you should
not end the URL with /~/ and should append a dot at the end in this
special case.

Pull backups
------------

Obnam can also access the live data over SFTP, instead of via the
local filesystem. This means you can run Obnam on, say, your desktop
machine to backup your server, or on your laptop to backup your phone
(assuming you can get an SSH server installed on your phone).
Sometimes it is not possible to install Obnam on the machine where the
live data resides, and then it is useful to do a **pull backup**
instead: you run Obnam on a different machine, and read the live data
over the SFTP protocol.

To do this, you specify the live data location (the `root` setting, or
as a command line argument to `obnam backup`) using an SFTP URL. You
should also set the client name explicitly. Otherwise Obnam will use
the hostname of the machine on which it runs as the name, and this can
be highly confusing: the client name is `my-laptop` and the server is
`down-with-clowns` and Obnam will store the backups as if the data
belongs to `my-laptop`.

It gets worse if you backup your laptop as well to the same backup
repository. Then Obnam will store both the server and the laptop
backups using the same client name, resulting in much confusion to
everyone.

Example:

    obnam backup -r /mnt/backups sftp://server.example.com/home \
        --client-name=server.example.com

Configuration files: a quick intro
----------------------------------

By this time you may have noticed that Obnam has a number of
configurable settings you can tweak in a number of ways. Doing it on
the command line is always possible, but then you get quite long
command lines. You can also put them into a configuration file.

Every command line option Obnam knows can be set in a configuration
file. Later in this manual there is a whole chapter that covers all
the details of configuration files, and all the various settings you
can use. For now, we'll give a quick introduction.

An Obnam configuration looks like this:

    [config]
    repository = /media/backup/tomjon-repo
    root = /home/liw/Documents, /home/liw/Photos
    exclude = \.mp3$
    exclude-caches = yes
    one-file-system = no

This form of configuration file is commonly known as an "INI file",
from Microsoft Windows `.INI` files. All the Obnam settings go into a
section titles `[config]`, and each setting has the same name as the
command line option, but without the double dash prefix. Thus, it's
`--exclude` on the command line and `exclude` in the configuration
file.

Some settings can have multiple values, such as `exclude` and `root`.
The values are comma separated. If there's a lot of values, you can
split them on multiple lines, where the second and later lines are
indented by space or TAB characters.

That should get you started, and you can reference the "Obnam
configuration files and settings" chapter for all the details.

When your precious data is very large
-------------------------------------

When your precious data is very large, the first backup may a very
long time. Ditto, if you get a lot of new precious data for a later
backup. In these cases, you may need to be very patient, and just let
the backup take its time, or you may choose to start small and add to
the backups a bit at a time.

The patient option is easy: you tell Obnam to backup everything, set
it running, and wait until it's done, even if it takes hours or days.
If the backup terminates prematurely, e.g., because of a network link
going down, you won't have to start from scratch thanks to Obnam's
checkpoint support. Every gigabyte or so (by default) Obnam stops a
backup run to create a checkpoint generation. If the backup later
crashes, you can just re-run Obnam and it will pick up from the latest
checkpoint. This is all fully automatic, you don't need to do anything
for it to happen. See the `--checkpoint` setting for choosing how
often the checkpoints should happen.

Note that if Obnam doesn't get to finish, and you have to re-start it,
the scanning starts over from the beginning. The checkpoint generation
does not contain enough state for Obnam to continue scanning from the
latest file in the checkpoint: it'd be very complicated state, and
easily invalidated by filesystem changes. Instead, Obnam scans all
files, but most files will hopefully not have changed since the
checkpoint was made, so the scanning should be relatively fast.

The only problem with the patient option is that your most precious
data doesn't get backed up while all your large, but less precious
data is being backed up. For example, you may have a large amount of
downloaded videos of conference presentations, which are nice, but not
hugely important. While those get backed up, your own documents do not
get backed up.

You can work around this by initially excluding everything except the
most precious data. When that is backed up, you gradually reduce the
excludes, re-running the backup, until you've backed up everything.
As an example, your first backup might have the following
configuration:

    obnam backup -r /media/backups/tomjon-repo ~ \
        --exclude ~/Downloads

This would exclude all downloaded files. The next backup run might
exclude only video files:

    obnam backup -r /media/backups/tomjon-repo ~ \
        --exclude ~/Downloads/'.*\.mp4$'

After this, you might reduce excludes to allow a few videos, such as
those whose name starts with a specific letter:

    obnam backup -r /media/backups/tomjon-repo ~ \
        --exclude ~/Downloads/'[^b-zB-Z].*\.mp4$'

Continue allowing more and more videos until they've all been backed
up.

De-duplication
--------------

Obnam de-duplicates the data it backs up, across all files in all
generations for all clients sharing the repository. It does this by
breaking up all file data into bits called chunks. Every time Obnam
reads a file and gets a chunk together, it looks into the backup
repository to see if an identical chunk already exists. If it does,
Obnam doesn't need to upload the chunk, saving space, bandwidth, and
time.

De-duplication in Obnam is useful in several situations:

* When you have two identical files, obviously. They might have
  different names, and be in different directories, but contain the
  same data.
* When a file keeps growing, but all the new data is added at the end.
  This is typical for log files, for example. If the leading chunks
  are unmodified, only the new data needs to be backed up.
* When a file or directory is renamed or moved. If you decide that the
  English name for the `Photos` directory is annoying and you want to
  use the Finnish `Valokuvat` instead, you can rename that in an
  instant. However, without de-duplication, you then have to backup
  all your photos again.
* When all a team works on the same things, and everyone has copies of
  the same files, the backup repository only needs one copy of each
  file, rather than one per team member.

De-duplication in Obnam isn't perfect. The granularity of finding
duplicate data is quite coarse (see the  `--chunk-size` setting), and
so Obnam often doesn't find duplication when it exists, when the
changes are small.

De-duplication isn't useful in the following scenarios:

* A file changes such that things move around within the file. The
  (current) Obnam de-duplication is based on non-overlapping chunks
  from the beginning of a file. If some data is inserted, Obnam won't
  notice that the chunks have shifted around. This can happen, for
  example, for disk or ISO images.

* Files with duplicate data that is not on a chunk boundary. For
  example, emails with large attachments. Each email recipient gets
  different `Received` headers, which shifts the body and attachments
  by different amounts. As a result, Obnam won't notice the
  duplication.

* Data in compressed files, such as `.zip` or `.tar.xz` files. Obnam
  doesn't know about the file compression, and only sees the
  compressed version of the data. Thus, Obnam won'd de-duplicate it.

A future version of Obnam will hopefully improve the de-duplication
algorithms. If you see this optimistic paragraph in a version of Obnam
released in 2017 or later, please notify the maintainers. Thank you.


De-duplication and safety against checksum collisions
-----------------------------------------------------

This is a bit of a scary topic, but it would be dishonest to not
discuss it at all. Feel free to come back to this section later.

Obnam uses the MD5 checksum algorithm for recognising duplicate data
chunks. MD5 has a reputation for being unsafe: people have constructed
files that are different, but result in the same MD5 checksum. This is
true. MD5 is not considered safe for security critical applications.

Every checksum algorithm can have collisions. Changing Obnam to use,
say, SHA1, SHA2, or the as new SHA3 algorithm would not remove the
chance of collisions. It would reduce the chance of accidental
collisions, but the chance of those is already so small with MD5 that
it can be disregarded. Or put in another way, if you care about the
chance of accidental MD5 collisions, you should be caring about
accidental SHA1, SHA2, or SHA3 collisions as well.

Apart from accidental collisions, there are two cases where you should
worry about checksum collisions (regardless of algorithm).

First, if you have an enemy who wishes to corrupt your backed up data,
they may replace some of the backed up data with other data that has
the same checksum. This way, when you restore, your data is corrupted
without Obnam noticing.

Second, if you're into researching checksum collisions, you're likely
to have files that cause checksum collisions, and in that case, if you
restore after a catastrophe, you probably want to get the files back
intact, rather having Obnam confuse one with the other.

To deal with these situations, Obnam has three de-duplication modes,
set using the `--deduplicate` setting:

* The default mode, `fatalist`, assumes checksum collisions do not
  happen. This is a reasonable compromise between performance, safety,
  and security for most people.
* The `verify` mode assumes checksum collisions do happen, and
  verifies that the already backed up chunk is identical to the chunk
  to be backed up, by comparing the actual data. Doing this requires
  downloading the chunk from the backup repository, which can be quite
  slow, since checksums will often match. This is a useful mode if you
  have very fast access to the backup repository, and want to
  de-duplicate, such as when the backup repository is on a locally
  connected hard drive.
* The `never` mode turns off de-duplication completely. This is 
  useful if you're worried about checksum collisions, and do not
  require de-duplication.

There is, unfortunately, no way to get both de-duplication that is
invulnerable to checksum collision and is fast even when accessing the
backup repository is slow. The only way to be invulnerable is to
compare the data, and if downloading the data from the repository is
slow, then the comparison will take significant time.

Locking
-------

Multiple clients can share a repository, and to prevent them from
trampling on each other, they lock parts of the repository while
working. The "Sharing a repository between multiple clients" chapter
will discuss this in more detail.

If Obnam terminates abruptly, even if there's only one client ever
using the repository, the lock may stay around and prevent that one
client for making new backups. The termination may be due to the
network connection breaking, or due to a bug in Obnam. It can also
happen if Obnam is interrupted by the user before it's finished.

The Obnam command `force-lock` deals with this situation. It is
dangerous, though. If you force open a lock that is in active use by
any running Obnam instance, on any client machine using that
repository, there will likely to be some stepping of toes. The result
may, in extreme cases, even result in repository corruption. So be
careful.

If you've decided you can safely do it, this is an example of how to
do it:

    obnam -r /media/backups/tomjon-repo force-lock

It is not currently possibly to only break locks related to one
client.

Consistency of live data
------------------------

Making a backup can take a good while. While the backup is running,
the filesystem may change. This leads to the snapshot of data Obnam
presents as a backup generation being internally inconsistent. For
example, before a backup you might have two files, A and B, which need
to be kept in sync. You run a backup, and while it's happening, you
change A, and then B. However, you're unlucky, and Obnam manages to
backup A before you save your changes, and B after you save changes to
that. The backup generation now has versions of A and B that are not
synchronised. This is bad.

This can be dealt with in various ways, depending on the
circumstances. Here's a few:

* The Logical Volume Manager (LVM) provides snapshots. You can set up
  your backups so that they first create a snapshot of each logical
  volume to be backed up, run the backup, and delete the snapshot
  afterwards. This prevents anyone from modifying the files in the
  snapshot, but allows normal use to continue while the backup
  happens.
* A similar thing can be done using the btrfs filesystem and its
  subvolumes.
* You can shut down the system, reboot it into single user mode, and
  run the backup, before rebooting back into normal mode. This is not
  a good way to do it, but it is the safest way to get a consistent
  snapshot of the filesystem.

Note that filesystem level snapshots can't really guarantee a
consistent view of the live data. An application may be in the middle
of writing a file, or set of files, when the snapshot is being made.
To some extent this indicates an application bug, but knowing that
doesn't let you sleep better.

Usually, though, most systems have enough idle time that a consistent
backup snapshot can happen during that time. For a laptop, for
example, a backup can be run while the user is elsewhere, instead of
actively using the machine.

Part of your backup verification suite should check that the data in a
backup generation is internally consistent, if that can be done.
Otherwise, you'll either have to analyse the applications you use, or
trust they're not too buggy.

If you didn't understand this section, don't worry and be happy and
sleep well.