Return-Path: X-Original-To: distix@pieni.net Delivered-To: distix@pieni.net Received: from yaffle.pepperfish.net (yaffle.pepperfish.net [88.99.213.221]) by pieni.net (Postfix) with ESMTPS id C4EC54020D for ; Mon, 24 Jul 2017 17:12:30 +0000 (UTC) Received: from platypus.pepperfish.net (unknown [10.112.101.20]) by yaffle.pepperfish.net (Postfix) with ESMTP id 5DEFE418A5; Mon, 24 Jul 2017 18:12:30 +0100 (BST) Received: from ip6-localhost.nat ([::1] helo=platypus.pepperfish.net) by platypus.pepperfish.net with esmtp (Exim 4.80 #2 (Debian)) id 1dZguE-0006Iq-BZ; Mon, 24 Jul 2017 18:12:30 +0100 Received: from [10.112.101.21] (helo=mx3.pepperfish.net) by platypus.pepperfish.net with esmtps (Exim 4.80 #2 (Debian)) id 1dZguC-0006Ia-Kb for ; Mon, 24 Jul 2017 18:12:28 +0100 Received: from barracuda.pco-inc.com ([71.4.36.131]) by mx3.pepperfish.net with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.89) (envelope-from ) id 1dZguA-0004Dz-DY for obnam-support@obnam.org; Mon, 24 Jul 2017 18:12:28 +0100 X-ASG-Debug-ID: 1500916335-0573a2109233ef90001-phrF5L Received: from Loki.pcopen.net ([10.0.0.65]) by barracuda.pco-inc.com with ESMTP id A6ZiJJZYZunuo1DJ (version=TLSv1.2 cipher=ECDHE-RSA-AES256-SHA384 bits=256 verify=NO); Mon, 24 Jul 2017 10:12:15 -0700 (PDT) X-Barracuda-Envelope-From: lperkins@openeye.net Received: from LOKI.pcopen.net ([fe80::39f5:aaff:14af:6002]) by Loki.pcopen.net ([fe80::39f5:aaff:14af:6002%10]) with mapi id 14.03.0351.000; Mon, 24 Jul 2017 10:12:16 -0700 From: "Laurence Perkins (OE)" To: "liw@liw.fi" Thread-Topic: Variable Chunksize X-ASG-Orig-Subj: Re: Variable Chunksize Thread-Index: AQHTALO1js8PRLWMaEGpkNs8UOGlYqJb6R8AgAGEm4CAAvDXAIADVVeA Date: Mon, 24 Jul 2017 17:12:15 +0000 Message-ID: <1500916329.13826.13.camel@openeye.net> References: <1500484994.13826.5.camel@openeye.net> <20170719181232.sdqihqdqldsgzmtd@liw.fi> <1500571405.13826.8.camel@openeye.net> <20170722141756.yzxatuvogrdsh4jv@liw.fi> In-Reply-To: <20170722141756.yzxatuvogrdsh4jv@liw.fi> Accept-Language: en-US Content-Language: en-US X-MS-Has-Attach: yes X-MS-TNEF-Correlator: x-originating-ip: [10.0.50.60] MIME-Version: 1.0 X-Barracuda-Connect: UNKNOWN[10.0.0.65] X-Barracuda-Start-Time: 1500916335 X-Barracuda-Encrypted: ECDHE-RSA-AES256-SHA384 X-Barracuda-URL: https://10.0.0.6:443/cgi-mod/mark.cgi X-Barracuda-Scan-Msg-Size: 1553 X-Virus-Scanned: by bsmtpd at pco-inc.com X-Barracuda-BRTS-Status: 1 X-Barracuda-Spam-Score: 0.00 X-Barracuda-Spam-Status: No, SCORE=0.00 using global scores of TAG_LEVEL=1000.0 QUARANTINE_LEVEL=1000.0 KILL_LEVEL=5.0 tests= X-Barracuda-Spam-Report: Code version 3.2, rules version 3.2.3.41251 Rule breakdown below pts rule name description ---- ---------------------- -------------------------------------------------- X-Pepperfish-Transaction: 7114-4a85-c704-6016 X-Spam-Score: -1.9 X-Spam-Score-int: -18 X-Spam-Bar: - X-Scanned-By: pepperfish.net, Mon, 24 Jul 2017 18:12:28 +0100 X-Spam-Report: Content analysis details: (-1.9 points) pts rule name description ---- ---------------------- -------------------------------------------------- -1.9 BAYES_00 BODY: Bayes spam probability is 0 to 1% [score: 0.0000] X-ACL-Warn: message may be spam X-Scan-Signature: efe41bf02ea65bb76851af65aec5a2d2 Cc: "obnam-support@obnam.org" Subject: Re: Variable Chunksize X-BeenThere: obnam-support@obnam.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Obnam backup software discussion List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Content-Type: multipart/mixed; boundary="===============0226912654442020829==" Mime-version: 1.0 Sender: obnam-support-bounces@obnam.org Errors-To: obnam-support-bounces@obnam.org --===============0226912654442020829== Content-Language: en-US Content-Type: multipart/signed; micalg=pgp-sha512; protocol="application/pgp-signature"; boundary="=-mkTgCnw/w4FzwjwkBM2+" --=-mkTgCnw/w4FzwjwkBM2+ Content-Type: text/plain; charset="UTF-7" Content-Transfer-Encoding: quoted-printable On Sat, 2017-07-22 at 17:17 +-0300, Lars Wirzenius wrote: +AD4 For more precise deduplication, it's almost certainly necessary to +AD4 use +AD4 smaller chunks than the megabyte sized ones Obnam currently uses by +AD4 default. Smaller chunks mean more chunks. +AD4=20 +AD4=20 Smaller chunk size makes deduplication more precise regardless of the type of splitting, but it should generate some pretty big savings on similar data without reducing the chunk size because it will be better at finding identical chunks of data since it's not relying on them being at fixed offsets. To illustrate, consider the following two datasets: aaaaabbbbb caaaaabbbbb With a fixed chunk size of 5, the first dataset makes two chunks and the second makes three with no chunks shared. With a variable chunk algorithm that splits when it sees +ACI-ab+ACI, they both make two chunks, and they share one. As with all deduplication, the chunk size must be tuned to the type of data you're working with, but even with the current 1MB chunk size, files larger than that will benefit any time they get a value inserted into the middle. Which is a common thing for files like sparse VM images and the like. Which just made me realise what kind of space savings I could get on one of my archives... I guess this just moved up the priority list a ways... Any suggestions for a hash that's fast enough to crawl over large files without hurting performance too much while still giving reasonable control over the average chunk size? LMP --=-mkTgCnw/w4FzwjwkBM2+ Content-Type: application/pgp-signature; name="signature.asc" Content-Description: This is a digitally signed message part Content-Transfer-Encoding: 7bit -----BEGIN PGP SIGNATURE----- iQIzBAABCgAdFiEEFbYe3ereZkZxAoz7C4CSuysVUSAFAll2KmkACgkQC4CSuysV USCyJxAAoHCoSWxXpswfJCx2bNm4YoERGJEPYE+a1G3zbj1xnolpDZOzji6vPesB HqIMYDzohU2fH1dqPZLDzQ8kRLrnHckrevEgdPHCRIX8ZSAS2O+tvXi9gq2CkDZ6 LrR8pbIJZ+1zWeoh6uFWmA3vN/502gyAy2aacXhJyyT7kl1u+wGpkgsyHKRSP0XJ PU4Ecxp5nWiYsCEq7R2prowLTBGO0Of+uSIw4xlT8M4iAwq/CiBp1dLK0Tg6P1aE 97+GOxcv73fMSTERRxErfl+/Ti3KfORvp8kjYKE29TjFtsBFEw5tfedLfLk81Ae1 0aorAJM+Reg/6T8D7CYhMwcFmDG08vRQwy1OlKVaJCAn/CyOkUyaCMpEFZXZOewy a380AvAPY6h9xYebl3Ot5MwQCLUHYVVVhKRrkOgr0cULrIuGZobUVFPHREUDwbvR fTDFEQuoLv11NJhyA/CwGbUKLtjtiByG9EKNEUo24vj+sTNcFL87pdpQxREwgCul Fehbaq2MRs7M6iMwDFBLPMlswMu/m66bnTQ+4e2zABiXtmlwGkWejJxY9EUgLG/Z FOwY5qXuDF315rwHkmHUXtdVXI1zn9Ke+Rf+rB2XxpGUzv8JRHbBQbGGjVCObFxV il40F4A6VPbiO2E7lhZ5RzytL1Smchois+43fvkahCJhiH07HsE= =ZJgW -----END PGP SIGNATURE----- --=-mkTgCnw/w4FzwjwkBM2+-- --===============0226912654442020829== Content-Type: text/plain; charset="us-ascii" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit Content-Disposition: inline _______________________________________________ obnam-support mailing list obnam-support@obnam.org http://listmaster.pepperfish.net/cgi-bin/mailman/listinfo/obnam-support-obnam.org --===============0226912654442020829==--