summaryrefslogtreecommitdiff
diff options
context:
space:
mode:
authorLars Wirzenius <liw@liw.fi>2016-02-12 11:44:10 +0200
committerLars Wirzenius <liw@liw.fi>2016-02-12 12:56:13 +0200
commit0a52be884af28f8564ecbe0caa544f732040eb10 (patch)
tree311bf3bd52ce14bc41b2a510e13c667cddd16f57
parent16372e61522686abfbbbb3aead8bef29b1e98769 (diff)
downloadgenbackupdata-0a52be884af28f8564ecbe0caa544f732040eb10.tar.gz
Add a manual and yarn test suite
-rw-r--r--manual.yarn174
1 files changed, 174 insertions, 0 deletions
diff --git a/manual.yarn b/manual.yarn
new file mode 100644
index 0000000..c117520
--- /dev/null
+++ b/manual.yarn
@@ -0,0 +1,174 @@
+---
+title: genbackupdata---generate data for backups
+author: Lars Wirzenius
+date: SEE GIT
+...
+
+
+# Introduction
+
+`genbackupdata` is a utility for generating data for testing backup
+software, specifically [Obnam][]. It is particularly intended for
+generating reproducible synthetic data for benchmarking. It is often
+desireable for a benchmark to be run by multiple parties, such that
+all but one variable is controller for. For example, comparing the
+same version of the backup software on two different computers. This
+requires the benchmark data to be the same, as well.
+
+[Obnam]: http://obnam.org/
+
+Any reasonable benchmarking will require a lot of data, and sharing
+that is expensive. Thus, genbackupdata could be considered a
+specialised compression tool: if two parties run the same version of
+genbackupdata with the same argument, the output should be bitwise
+identical. Thus, a program of a few tens of kilobytes of course code
+would replace a data set of any size.
+
+The data generated by genbackupdata is random binary junk, using the
+RC4 algorithm. It is meant to be uncompressible and non-repeating, to
+be a worst case for backup software like Obnam.
+
+In addition to just generating data, genbackupdata generates a
+directory tree, with files of a desired size. It turns out that backup
+software have a per-file cost, and thus backing up a single gigabyte
+file is likely to be less expensive than having a billion one-byte
+files.
+
+genbackupdata is about the simplest possible implementation of these
+ideas. It could be improved in many ways, such as by producing
+different kinds of data (text in various languages, completely or
+partially duplicated files) or file sizes of a suitable statistical
+distribution. However, it sufficies for Obnam development, and thus
+the author has no incentive to develop it further. If someone wants to
+take over and make the software more versatile, they should feel free
+to do so.
+
+## About this manual
+
+This manual gives an overview of how genbackupdata can be used. For
+detailed usage information, please see the manual page or the output
+of `genbackupdata --help`.
+
+The other purpose of this manual is to act as an automated integration
+test suite for genbackupdata. Run this manual source code through the
+[yarn][] tool to run the tests.
+
+[yarn]: http://liw.fi/cmdtest/
+
+
+# Simple usage
+
+The simplest way to use genbackupdata is to tell it to generate the
+desired amount of data. The amount is given with the `--create`
+option, which takes argument giving the size in bytes.
+
+ SCENARIO generate some data
+ WHEN user runs genbackupdata --create=100 foo
+ THEN directory foo contains 100 bytes in files
+
+The `--create` size may also be given in bigger units (kilobytes,
+megabytes, etc), using suffixes, such as `k` for kilobyte (1000
+bytes).
+
+ WHEN user runs genbackupdata --create=100k bar
+ THEN directory bar contains 100000 bytes in files
+
+Further, the data is mostly uncompressible.
+
+ AND directory bar is about 100000 bytes when compressed
+
+# Multiple runs
+
+Every run of genbackupdata produces the same sequence of random bytes.
+Running it twice with the same arguments will produce the same data
+twice. Since genbackupdata does not overwrite existing files, the data
+is highly compressible now.
+
+ SCENARIO run genbackupdata twice
+ WHEN user runs genbackupdata --create=100k foo
+ AND user runs genbackupdata --create=100k foo
+ THEN directory foo contains 200000 bytes in files
+ AND all files in foo are duplicates
+
+# Control file size
+
+The maximum size of output files can be specified. This allows the
+user to generate single, very large file, or a large number of small
+files.
+
+ SCENARIO control file size
+ WHEN user runs genbackupdata --create=100k --file-size=1m bigfile
+ THEN directory bigfile contains 1 file
+
+ WHEN user runs genbackupdata --create=1000 --file-size=1 manyfiles
+ THEN directory manyfiles contains 1000 files
+
+# Appendix: scenario step implementations
+
+This chapter implements the various scenario steps used in this
+manual.
+
+ IMPLEMENTS WHEN user runs genbackupdata --create=(\S+) (.+)
+ import os
+ import cliapp
+ size = os.environ['MATCH_1']
+ args = os.environ['MATCH_2'].split()
+ opts = args[:-1]
+ dirname = os.path.join(os.environ['DATADIR'], args[-1])
+ bin = os.path.join(os.environ['SRCDIR'], 'genbackupdata')
+ cliapp.runcmd([bin, '--create', size] + opts + [dirname])
+
+ IMPLEMENTS THEN directory (\S+) contains (\d+) bytes in files
+ import os
+ root = os.path.join(os.environ['DATADIR'], os.environ['MATCH_1'])
+ wanted_bytes = int(os.environ['MATCH_2'])
+ total_bytes = 0
+ for dirname, subdirs, filenames in os.walk(root):
+ for filename in filenames:
+ pathname = os.path.join(dirname, filename)
+ print pathname, os.path.getsize(pathname)
+ total_bytes += os.path.getsize(pathname)
+ assert wanted_bytes == total_bytes, \
+ '%s != %s' % (wanted_bytes, total_bytes)
+
+ IMPLEMENTS THEN directory (\S+) is about (\d+) bytes when compressed
+ import os
+ import zlib
+ root = os.path.join(os.environ['DATADIR'], os.environ['MATCH_1'])
+ wanted_bytes = int(os.environ['MATCH_2'])
+ data = ''
+ for dirname, subdirs, filenames in os.walk(root):
+ for filename in filenames:
+ pathname = os.path.join(dirname, filename)
+ with open(pathname) as f:
+ data += f.read()
+ compressed = zlib.compress(data)
+ size_delta = len(compressed) - len(data)
+ print 'data:', len(data)
+ print 'compressed:', len(compressed)
+ print 'size_delta:', size_delta
+ assert abs(size_delta) < 1000
+
+ IMPLEMENTS THEN all files in (\S+) are duplicates
+ import collections
+ import os
+ root = os.path.join(os.environ['DATADIR'], os.environ['MATCH_1'])
+ files = collections.Counter()
+ for dirname, subdirs, filenames in os.walk(root):
+ for filename in filenames:
+ pathname = os.path.join(dirname, filename)
+ with open(pathname) as f:
+ data = f.read()
+ files[data] += 1
+ for data in files:
+ assert files[data] == 2
+
+ IMPLEMENTS THEN directory (\S+) contains (\d+) files?
+ import collections
+ import os
+ root = os.path.join(os.environ['DATADIR'], os.environ['MATCH_1'])
+ wanted_count = int(os.environ['MATCH_2'])
+ file_count = 0
+ for dirname, subdirs, filenames in os.walk(root):
+ file_count += len(filenames)
+ assert file_count == wanted_count