summaryrefslogtreecommitdiff
path: root/tickets/11c32688f6ae4c039e5fef65b5007a88/Maildir/new/1500571585.M471479P20604Q1.koom
blob: 9d05d074b839df392da2632f7c5b3a0c3545495c (plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
Return-Path: <obnam-support-bounces@obnam.org>
X-Original-To: distix@pieni.net
Delivered-To: distix@pieni.net
Received: from yaffle.pepperfish.net (yaffle.pepperfish.net [88.99.213.221])
	by pieni.net (Postfix) with ESMTPS id 3232A4378C
	for <distix@pieni.net>; Thu, 20 Jul 2017 17:23:53 +0000 (UTC)
Received: from platypus.pepperfish.net (unknown [10.112.101.20])
	by yaffle.pepperfish.net (Postfix) with ESMTP id AFA25417A5;
	Thu, 20 Jul 2017 18:23:52 +0100 (BST)
Received: from ip6-localhost.nat ([::1] helo=platypus.pepperfish.net)
	by platypus.pepperfish.net with esmtp (Exim 4.80 #2 (Debian))
	id 1dYFB2-0003LW-MF; Thu, 20 Jul 2017 18:23:52 +0100
Received: from [10.112.101.21] (helo=mx3.pepperfish.net)
 by platypus.pepperfish.net with esmtps (Exim 4.80 #2 (Debian))
 id 1dYFB1-0003LF-LY
 for <obnam-support@obnam.org>; Thu, 20 Jul 2017 18:23:51 +0100
Received: from barracuda.pco-inc.com ([71.4.36.131])
 by mx3.pepperfish.net with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256)
 (Exim 4.89) (envelope-from <lperkins@openeye.net>)
 id 1dYFAz-0003VG-Fi
 for obnam-support@obnam.org; Thu, 20 Jul 2017 18:23:51 +0100
X-ASG-Debug-ID: 1500571418-0573a210922c6380001-phrF5L
Received: from Loki.pcopen.net ([10.0.0.65]) by barracuda.pco-inc.com with
 ESMTP id O2txcpxbBEzlTh7P (version=TLSv1.2 cipher=ECDHE-RSA-AES256-SHA384
 bits=256 verify=NO); Thu, 20 Jul 2017 10:23:38 -0700 (PDT)
X-Barracuda-Envelope-From: lperkins@openeye.net
Received: from LOKI.pcopen.net ([fe80::39f5:aaff:14af:6002]) by
 Loki.pcopen.net ([fe80::39f5:aaff:14af:6002%10]) with mapi id 14.03.0351.000; 
 Thu, 20 Jul 2017 10:23:37 -0700
From: "Laurence Perkins (OE)" <lperkins@openeye.net>
To: "liw@liw.fi" <liw@liw.fi>
Thread-Topic: Variable Chunksize
X-ASG-Orig-Subj: Re: Variable Chunksize
Thread-Index: AQHTALO1js8PRLWMaEGpkNs8UOGlYqJb6R8AgAGEm4A=
Date: Thu, 20 Jul 2017 17:23:37 +0000
Message-ID: <1500571405.13826.8.camel@openeye.net>
References: <1500484994.13826.5.camel@openeye.net>
 <20170719181232.sdqihqdqldsgzmtd@liw.fi>
In-Reply-To: <20170719181232.sdqihqdqldsgzmtd@liw.fi>
Accept-Language: en-US
Content-Language: en-US
X-MS-Has-Attach: yes
X-MS-TNEF-Correlator: 
x-originating-ip: [10.0.50.60]
MIME-Version: 1.0
X-Barracuda-Connect: UNKNOWN[10.0.0.65]
X-Barracuda-Start-Time: 1500571418
X-Barracuda-Encrypted: ECDHE-RSA-AES256-SHA384
X-Barracuda-URL: https://10.0.0.6:443/cgi-mod/mark.cgi
X-Barracuda-Scan-Msg-Size: 4950
X-Virus-Scanned: by bsmtpd at pco-inc.com
X-Barracuda-BRTS-Status: 1
X-Barracuda-BRTS-Evidence: aea-obnam.org
X-Barracuda-BRTS-Evidence: port-obnam.org
X-Barracuda-Spam-Score: 0.00
X-Barracuda-Spam-Status: No, SCORE=0.00 using global scores of TAG_LEVEL=1000.0
 QUARANTINE_LEVEL=1000.0 KILL_LEVEL=5.0 tests=
X-Barracuda-Spam-Report: Code version 3.2, rules version 3.2.3.41130
 Rule breakdown below
 pts rule name              description
 ---- ---------------------- --------------------------------------------------
X-Pepperfish-Transaction: 2871-25ff-cc85-139e
X-Spam-Score: -1.9
X-Spam-Score-int: -18
X-Spam-Bar: -
X-Scanned-By: pepperfish.net, Thu, 20 Jul 2017 18:23:51 +0100
X-Spam-Report: Content analysis details: (-1.9 points)
 pts rule name              description
 ---- ---------------------- --------------------------------------------------
 -1.9 BAYES_00               BODY: Bayes spam probability is 0 to 1%
 [score: 0.0000]
X-ACL-Warn: message may be spam
X-Scan-Signature: ba526989c802b3f197cd2fc106aa2555
Cc: "obnam-support@obnam.org" <obnam-support@obnam.org>
Subject: Re: Variable Chunksize
X-BeenThere: obnam-support@obnam.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Obnam backup software discussion <obnam-support-obnam.org>
List-Unsubscribe: <http://listmaster.pepperfish.net/cgi-bin/mailman/listinfo/obnam-support-obnam.org>,
 <mailto:obnam-support-request@obnam.org?subject=unsubscribe>
List-Archive: <http://listmaster.pepperfish.net/pipermail/obnam-support-obnam.org>
List-Post: <mailto:obnam-support@obnam.org>
List-Help: <mailto:obnam-support-request@obnam.org?subject=help>
List-Subscribe: <http://listmaster.pepperfish.net/cgi-bin/mailman/listinfo/obnam-support-obnam.org>,
 <mailto:obnam-support-request@obnam.org?subject=subscribe>
Content-Type: multipart/mixed; boundary="===============0595421157618928757=="
Mime-version: 1.0
Sender: obnam-support-bounces@obnam.org
Errors-To: obnam-support-bounces@obnam.org

--===============0595421157618928757==
Content-Language: en-US
Content-Type: multipart/signed; micalg=pgp-sha512;
 protocol="application/pgp-signature"; boundary="=-YIIC7W8/nGsEbIikuEZg"

--=-YIIC7W8/nGsEbIikuEZg
Content-Type: text/plain; charset="UTF-7"
Content-Transfer-Encoding: quoted-printable

Hmmm...  it shouldn't actually increase the number of chunks on
average.  The ones that come out a little smaller will get balanced out
by ones that come out a little bigger.  I suppose a really unlucky
dataset could end up with a whole pile of tiny chunks, but adding a
minimum and maximum chunk size wouldn't change the algorithm that much,
and would mitigate that problem.

To illustrate, say we're storing a string of random digits 0-9.=20
Statistically, if the set is large enough, splitting every ten
characters should yield approximately the same number of chunks as
splitting every time a 1 comes up. The bigger the repo, the more likely
it is to come out the same.

Since the data going in obviously won't be random, the key is to use a
hash function where the output is (at least statistically).

When I get some spare time I'll see if I can hack it into the chunking
algorithm and test what happens with some real data sets.  I suspect
the real challenge will be finding a hash algorithm with sufficient
entropy that can be computed that many times without a backup taking a
month or having to resort to loading the calculations into the GPU...



On Wed, 2017-07-19 at 21:12 +-0300, Lars Wirzenius wrote:
+AD4 This has, in fact, been discussed before, and it's my intention to
+AD4 look at implementing this once FORMAT GREEN ALBATROSS is finished.
+AD4 The
+AD4 issue is that variable-sized chunking may increase the number of
+AD4 chunks by a lot, and I'm not sure Obnam's data structures will deal
+AD4 with that sufficiently efficiently.
+AD4=20
+AD4 On Wed, Jul 19, 2017 at 05:23:16PM +-0000, Laurence Perkins (OE)
+AD4 wrote:
+AD4 +AD4 Stumbled across a couple of rsync/backup type programs that use
+AD4 +AD4 variable-sized chunks to improve backup performance.+AKAAoA-It mi=
ght be
+AD4 +AD4 worth
+AD4 +AD4 considering adding as an option.
+AD4 +AD4=20
+AD4 +AD4 The concept is actually relatively simple.+AKAAoA-Instead of spli=
tting
+AD4 +AD4 chunks
+AD4 +AD4 at fixed sizes, chunks are split when the data matches some
+AD4 +AD4 heuristic.+AKA
+AD4 +AD4 The common way to do it is to do a byte-by-byte hash of the strea=
m
+AD4 +AD4 and
+AD4 +AD4 put in a chunk boundary wherever the hash meets some
+AD4 +AD4 criteria.+AKAAoA-With a
+AD4 +AD4 bit of statistics you can set what the average chunksize will be
+AD4 +AD4 just
+AD4 +AD4 by tweaking the criteria slightly.+AKAAoA(So, read say the first =
1KB,
+AD4 +AD4 hash
+AD4 +AD4 it, then drop the first byte from the data to hash and add 1KB+-1=
B
+AD4 +AD4 to
+AD4 +AD4 the end of the data to hash and repeat.+AKAAoA-You walk over the =
data a
+AD4 +AD4 byte
+AD4 +AD4 at a time while still keeping a big enough amount of data going
+AD4 +AD4 into
+AD4 +AD4 the hash algorithm to produce interesting output.+AKAAoA-Split th=
e chunk
+AD4 +AD4 when
+AD4 +AD4 the first X bytes of the hash are zero or whatever's convenient.)
+AD4 +AD4=20
+AD4 +AD4 Other than that, the rest of the chunking and indexing routines
+AD4 +AD4 work
+AD4 +AD4 the same way they do now.
+AD4 +AD4=20
+AD4 +AD4 The advantage is that adding a single byte to the start of a file
+AD4 +AD4 doesn't cause the entire file to be re-uploaded.+AKAAoA-It will j=
ust
+AD4 +AD4 increase
+AD4 +AD4 the size of the first chunk by one byte and re-use the rest. (Or,
+AD4 +AD4 if
+AD4 +AD4 it's exactly in the chunk boundary, it will change the first two
+AD4 +AD4 chunks.+AKAAoA-But still, that's a lot less data to re-upload.)+A=
KAAoA-It will
+AD4 +AD4 also
+AD4 +AD4 increase the amount of deduplication between nearly-identical
+AD4 +AD4 files.
+AD4 +AD4=20
+AD4 +AD4 The downside is the overhead involved in walking a hash algorithm
+AD4 +AD4 across the data, but it's not like you need a particularly CPU
+AD4 +AD4 intensive hash, or even any significant collision resistance, you
+AD4 +AD4 just
+AD4 +AD4 need one that produces enough entropy to get your chunk sizes in
+AD4 +AD4 the
+AD4 +AD4 realm you want.
+AD4 +AD4=20
+AD4 +AD4 This feature would significantly increase Obnam's deduplication
+AD4 +AD4 capabilities, without affecting the complexity of the datastore (=
It
+AD4 +AD4 already supports multiple chunk sizes in the repo.)
+AD4 +AD4=20
+AD4 +AD4 I will take a look and see if I can figure out how to add this in=
,
+AD4 +AD4 but
+AD4 +AD4 I'm not a particularly tidy coder on the best of days, so someone
+AD4 +AD4 with
+AD4 +AD4 a bit more Python design experience than I have might be quicker =
at
+AD4 +AD4 it.
+AD4 +AD4 +AKA-Like...+AKAAoA-A lot quicker...
+AD4 +AD4=20
+AD4 +AD4 LMP
+AD4 +AD4 +AF8AXwBfAF8AXwBfAF8AXwBfAF8AXwBfAF8AXwBfAF8AXwBfAF8AXwBfAF8AXwBf=
AF8AXwBfAF8AXwBfAF8AXwBfAF8AXwBfAF8AXwBfAF8AXwBfAF8AXwBfAF8AXw
+AD4 +AD4 obnam-support mailing list
+AD4 +AD4 obnam-support+AEA-obnam.org
+AD4 +AD4 http://listmaster.pepperfish.net/cgi-bin/mailman/listinfo/obnam-s=
up
+AD4 +AD4 port-obnam.org
+AD4 +AD4=20
+AD4=20
+AD4=20
--=-YIIC7W8/nGsEbIikuEZg
Content-Type: application/pgp-signature; name="signature.asc"
Content-Description: This is a digitally signed message part
Content-Transfer-Encoding: 7bit

-----BEGIN PGP SIGNATURE-----

iQIzBAABCgAdFiEEFbYe3ereZkZxAoz7C4CSuysVUSAFAllw5w4ACgkQC4CSuysV
USDBDQ//YKidqOuyaO2pXyZzGZCTs/kkKoFarI15mefwTe25GNqhFuhtZjCnmRcV
hk8YbbymXhphQ9RbryWpGciTgWl9yomGxMWpnc5nyQKWUPYuR7pTFvT+8rb8/5Wl
m0K6UAKryoHzsPzbpJ7Cq5OSx1ZJdBt/kplRRUCUgbw1H4Aw+87g0jwXGvvoJLVw
rdeuny7dGL+IzyvRM3AXxkoqw5MBi3CJgzVLbjwA5wNVTn9mjQ+Wd1rdHBM1EhmF
BIMrI7jxJRrA3Oe2Z6Z5rp0CReZ0yfDgOTsiGnQIDQqE4RjW3kSipW0lWbK7z0/d
Ul/NVh2OzNhv8avm4pnGZpVv0mHrFmszKQxYpv5sxcvhEOscNDius4/lFbeCALro
NURR3wu3ap5g2bRoEfn/ZSlvB+FaN2dis1lHY6wqCSOySC1sEJIFGEKketpEhk8D
dfJuXhoUrBaARu94qtFdOHubLX51e4eor45NXpdFZEN8RSLlVa72mxK199D722Ng
Jz8LzMSI0t8kevScY714sNH7Vj4oDtPhMO2Rb0IWu9vKTf98P0OCUjFUo36lmbmu
Njx5VPYneVhsUM1mftG9a4+42qBaie2ApY6mYFYKGnQRPGe0BZQ6KZL+DzQxch67
othfc0C0y3E6ZSUWm+o1RO2Wp043kMQuvaJKK28CWy0Rodu3vx8=
=IZJ6
-----END PGP SIGNATURE-----

--=-YIIC7W8/nGsEbIikuEZg--


--===============0595421157618928757==
Content-Type: text/plain; charset="us-ascii"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
Content-Disposition: inline

_______________________________________________
obnam-support mailing list
obnam-support@obnam.org
http://listmaster.pepperfish.net/cgi-bin/mailman/listinfo/obnam-support-obnam.org

--===============0595421157618928757==--