summaryrefslogtreecommitdiff
path: root/2021-05-24-backup-tech.md
blob: 1382c052f08db3b2deca3cfc051460ec43137be5 (plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
![](Box.jpg)

Broc, [`commons: Cat_into_the_box.jpg`](https://commons.wikimedia.org/wiki/File:Cat_into_the_box.jpg)


-----------------------------------------------------------------------------

Backups are easy.

~~~dot
digraph "" {

data [label="Data" shape=cylinder];
backup [label="Backup" shape=cylinder];

data -> backup [label="copy"];

}
~~~

-----------------------------------------------------------------------------


The end. That's all you need to know about backups.

-----------------------------------------------------------------------------

OK, so there's a little more to it, sometimes.

-----------------------------------------------------------------------------

# Challenges

* Full copy takes too long if you do it every time. **Incremental**
  backups only copy what's changed.

* A sequence of backups over time would be nice, but take up a lot of
  space without **de-duplication**.
  
* Storage is expensive, so **compression** would be nice.

* Backups can be stolen, so it'd be nice for them to be **encrypted**.

* Backups can be tampered with, so it'd be nice for them to be
  **authenticated** (digitally signed). Otherwise you don't know if
  you can trust the data you get when restoring.

-----------------------------------------------------------------------------

Everything is too difficult, and takes too much effort.

-----------------------------------------------------------------------------

# Incremental backups

Compare live data against most recent backup. If file has hasn't
changed, assume the previous copy is still good and use that instead
of making a new copy.
  
- some assumptions are necessary here to get performance: simplified:
  if file size is the same and file modification time is the same, the
  file content probably hasn't changed
  
- not having to read and store every file every time is a huge time
  saver
  
- not having to store every file every time saves space, but that's
  less important than time savings

-----------------------------------------------------------------------------

# De-duplication, chunking

Store each distinct bit of data only once, across all backups.

Whole files is easy, but unsatisfactory.

Split files into smaller chunks for better performance, at some
expense of more book keeping overhead. Store each distinct chunk
separately.

Splitting by size (8 KiB?) is easy. Split using a rolling checksum for
handling inserted data, e.g., identical attachments in email spools.
Chunk ends when checksum has N lowest bits zero, or when it reaches a
certain size.

-----------------------------------------------------------------------------

# Compare chunks

Strong cryptographic checksum: SHA256?

- when do collisions matter?

Bit by bit.

- may have bad performance

-----------------------------------------------------------------------------

# Compression

Lossless compression is easy. 

Low compression → really fast (faster than disk I/O).

Higher compression → more CPU, takes more time.

Lossy compression → replace every picture with the "Cat in a box"

-----------------------------------------------------------------------------

# Encryption and authentication

Authenticated encryption. Roughly: encrypt data with one key, then
compute strong hash of ciphertext, and encrypt that with second key,
store both ciphertexts.

When restoring, decrypt encrypted hash, compare that to encrypted
data, and only if they match, decrypt data.

If stored chunk has been modified in any way, it's detected.

(Read details: 
[`en.wikipedia.org/wiki/Authenticated_encryption`](https://en.wikipedia.org/wiki/Authenticated_encryption))

-----------------------------------------------------------------------------

# Error correction

Detecting errors isn't enough. Sometimes backups deteriorate. Error
correcting codes would help. Or store each backup multiple times a la
RAID.


-----------------------------------------------------------------------------

~~~dot
digraph "" {

file1 [label="File 1" shape=tab];
file2 [label="File 2" shape=tab];
file3 [label="File 3" shape=tab];

chunk1 [label="Chunk 1" shape=box];
chunk2 [label="Chunk 2" shape=box];

backup [label="Backup process"];

backup1 [label="Backup 1: \n file 1, file 2, file 3" shape=note];
backup2 [label="Backup 2: \n file 1, file 2, file 3" shape=note];

file1 -> backup;
file2 -> backup;
file3 -> backup;

backup -> backup1;
backup -> backup2;

backup1 -> chunk1;
backup1 -> chunk2;

backup2 -> chunk1;
backup2 -> chunk2;

}
~~~


-----------------------------------------------------------------------------

# Summary

Backups are easy.

~~~dot
digraph "" {

data [label="Data" shape=cylinder];
backup [label="Backup" shape=cylinder];

data -> backup [label="run tool"];

}
~~~

The end. That's all you **should** need to know about backups.

-----------------------------------------------------------------------------

![](danger.jpg)

Dano, [`commons: Danger_thin_ice_keep_off`](https://commons.wikimedia.org/wiki/File:Danger_thin_ice_keep_off_(287326530).jpg)

-----------------------------------------------------------------------------

# Backups could be more easy

Backup tools need to be installed, configured, and used. Backup
storage needs to be arranged.

What if backups just happen without you needing to do anything?

~~~dot
digraph "" {

data [label="Data" shape=cylinder];
backup [label="Backup" shape=cylinder];

data -> backup [label="just happens"];

}
~~~

-----------------------------------------------------------------------------

# My current crazy idea: Peer to peer backups


Safe, secure, without having to provide separate backup space, run a
server, buy a disk, or pay someone for a service.

Backup software comes as part of the operating system, and
runs automatically, when data changes, in a way that doesn't bother
the user.

You allow others to store some of their backups on your computer, and
they let you store some of yours on theirs. Quid pro quo.

Network effect: this works better the more people use it.

(Some details need to be sorted out.)

-----------------------------------------------------------------------------

~~~dot
digraph "" {

computer1 [label="Alice" shape=cylinder];
computer2 [label="Bob" shape=cylinder];
computer3 [label="Charlie" shape=cylinder];

computer1 -> computer2;
computer1 -> computer3;

computer2 -> computer1;
computer2 -> computer3;

computer3 -> computer2;


}
~~~


-----------------------------------------------------------------------------

# Legalese

Copyright 2021 Lars Wirzenius

This content is licensed under the Creative Commons
Attribution-ShareAlike 4.0 International ([CC BY-SA 4.0][]) licence.

[CC BY-SA 4.0]: https://creativecommons.org/licenses/by-sa/4.0/


---
title: "Introduction to backup technology"
subtitle: "An opinionated overview"
author: "Lars Wirzenius"
date: "2021-05-24"
...