1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
|
---
title: genbackupdata---generate data for backups
author: Lars Wirzenius
date: SEE GIT
...
# Introduction
`genbackupdata` is a utility for generating data for testing backup
software, specifically [Obnam][]. It is particularly intended for
generating reproducible synthetic data for benchmarking. It is often
desireable for a benchmark to be run by multiple parties, such that
all but one variable is controller for. For example, comparing the
same version of the backup software on two different computers. This
requires the benchmark data to be the same, as well.
[Obnam]: http://obnam.org/
Any reasonable benchmarking will require a lot of data, and sharing
that is expensive. Thus, genbackupdata could be considered a
specialised compression tool: if two parties run the same version of
genbackupdata with the same argument, the output should be bitwise
identical. Thus, a program of a few tens of kilobytes of course code
would replace a data set of any size.
The data generated by genbackupdata is random binary junk, using the
RC4 algorithm. It is meant to be uncompressible and non-repeating, to
be a worst case for backup software like Obnam.
In addition to just generating data, genbackupdata generates a
directory tree, with files of a desired size. It turns out that backup
software have a per-file cost, and thus backing up a single gigabyte
file is likely to be less expensive than having a billion one-byte
files.
genbackupdata is about the simplest possible implementation of these
ideas. It could be improved in many ways, such as by producing
different kinds of data (text in various languages, completely or
partially duplicated files) or file sizes of a suitable statistical
distribution. However, it sufficies for Obnam development, and thus
the author has no incentive to develop it further. If someone wants to
take over and make the software more versatile, they should feel free
to do so.
## About this manual
This manual gives an overview of how genbackupdata can be used. For
detailed usage information, please see the manual page or the output
of `genbackupdata --help`.
The other purpose of this manual is to act as an automated integration
test suite for genbackupdata. Run this manual source code through the
[yarn][] tool to run the tests.
[yarn]: http://liw.fi/cmdtest/
# Simple usage
The simplest way to use genbackupdata is to tell it to generate the
desired amount of data. The amount is given with the `--create`
option, which takes argument giving the size in bytes.
SCENARIO generate some data
WHEN user runs genbackupdata --create=100 foo
THEN directory foo contains 100 bytes in files
The `--create` size may also be given in bigger units (kilobytes,
megabytes, etc), using suffixes, such as `k` for kilobyte (1000
bytes).
WHEN user runs genbackupdata --create=100k bar
THEN directory bar contains 100000 bytes in files
Further, the data is mostly uncompressible.
AND directory bar is about 100000 bytes when compressed
# Multiple runs
Every run of genbackupdata produces the same sequence of random bytes.
Running it twice with the same arguments will produce the same data
twice. Since genbackupdata does not overwrite existing files, the data
is highly compressible now.
SCENARIO run genbackupdata twice
WHEN user runs genbackupdata --create=100k foo
AND user runs genbackupdata --create=100k foo
THEN directory foo contains 200000 bytes in files
AND all files in foo are duplicates
# Control file size
The maximum size of output files can be specified. This allows the
user to generate single, very large file, or a large number of small
files.
SCENARIO control file size
WHEN user runs genbackupdata --create=100k --file-size=1m bigfile
THEN directory bigfile contains 1 file
WHEN user runs genbackupdata --create=1000 --file-size=1 manyfiles
THEN directory manyfiles contains 1000 files
# Appendix: scenario step implementations
This chapter implements the various scenario steps used in this
manual.
IMPLEMENTS WHEN user runs genbackupdata --create=(\S+) (.+)
import cliapp
import yarnstep
size = yarnstep.get_next_match()
args = yarnstep.get_next_match().split()
opts = args[:-1]
dirname = yarnstep.datadir(args[-1])
bin = yarnstep.srcdir('genbackupdata')
cliapp.runcmd([bin, '--create', size] + opts + [dirname])
IMPLEMENTS THEN directory (\S+) contains (\d+) bytes in files
import os
import yarnstep
root = yarnstep.get_next_match_as_datadir_path()
wanted_bytes = yarnstep.get_next_match_as_int()
total_bytes = sum(
os.path.getsize(x) for x in yarnstep.iter_over_files(root))
assert wanted_bytes == total_bytes, \
'%s != %s' % (wanted_bytes, total_bytes)
IMPLEMENTS THEN directory (\S+) is about (\d+) bytes when compressed
import zlib
import yarnstep
root = yarnstep.get_next_match_as_datadir_path()
wanted_bytes = yarnstep.get_next_match_as_int()
data = ''.join(yarnstep.cat(x) for x in yarnstep.iter_over_files(root))
compressed = zlib.compress(data)
size_delta = len(compressed) - len(data)
assert abs(size_delta) < 1000
IMPLEMENTS THEN all files in (\S+) are duplicates
import collections
import yarnstep
root = yarnstep.get_next_match_as_datadir_path()
files = collections.Counter()
for pathname in yarnstep.iter_over_files(root):
files[yarnstep.cat(pathname)] += 1
for data in files:
assert files[data] == 2
IMPLEMENTS THEN directory (\S+) contains (\d+) files?
import collections
import yarnstep
root = yarnstep.get_next_match_as_datadir_path()
wanted_count = yarnstep.get_next_match_as_int()
file_count = len(list(yarnstep.iter_over_files(root)))
assert file_count == wanted_count
|