summaryrefslogtreecommitdiff
path: root/manual.yarn
blob: bfbd90c776ab3e978028b6c1c4cb4ed57f07e5a7 (plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
---
title: genbackupdata---generate data for backups
author: Lars Wirzenius
date: SEE GIT
...


# Introduction

`genbackupdata` is a utility for generating data for testing backup
software, specifically [Obnam][]. It is particularly intended for
generating reproducible synthetic data for benchmarking. It is often
desireable for a benchmark to be run by multiple parties, such that
all but one variable is controller for. For example, comparing the
same version of the backup software on two different computers. This
requires the benchmark data to be the same, as well.

[Obnam]: http://obnam.org/

Any reasonable benchmarking will require a lot of data, and sharing
that is expensive. Thus, genbackupdata could be considered a
specialised compression tool: if two parties run the same version of
genbackupdata with the same argument, the output should be bitwise
identical. Thus, a program of a few tens of kilobytes of course code
would replace a data set of any size.

The data generated by genbackupdata is random binary junk, using the
RC4 algorithm. It is meant to be uncompressible and non-repeating, to
be a worst case for backup software like Obnam.

In addition to just generating data, genbackupdata generates a
directory tree, with files of a desired size. It turns out that backup
software have a per-file cost, and thus backing up a single gigabyte
file is likely to be less expensive than having a billion one-byte
files.

genbackupdata is about the simplest possible implementation of these
ideas. It could be improved in many ways, such as by producing
different kinds of data (text in various languages, completely or
partially duplicated files) or file sizes of a suitable statistical
distribution. However, it sufficies for Obnam development, and thus
the author has no incentive to develop it further. If someone wants to
take over and make the software more versatile, they should feel free
to do so.

## About this manual

This manual gives an overview of how genbackupdata can be used. For
detailed usage information, please see the manual page or the output
of `genbackupdata --help`.

The other purpose of this manual is to act as an automated integration
test suite for genbackupdata. Run this manual source code through the
[yarn][] tool to run the tests.

[yarn]: http://liw.fi/cmdtest/


# Simple usage

The simplest way to use genbackupdata is to tell it to generate the
desired amount of data. The amount is given with the `--create`
option, which takes argument giving the size in bytes. 

    SCENARIO generate some data
    WHEN user runs genbackupdata --create=100 foo
    THEN directory foo contains 100 bytes in files

The `--create` size may also be given in bigger units (kilobytes,
megabytes, etc), using suffixes, such as `k` for kilobyte (1000
bytes).

    WHEN user runs genbackupdata --create=100k bar
    THEN directory bar contains 100000 bytes in files

Further, the data is mostly uncompressible.

    AND directory bar is about 100000 bytes when compressed

# Multiple runs

Every run of genbackupdata produces the same sequence of random bytes.
Running it twice with the same arguments will produce the same data
twice. Since genbackupdata does not overwrite existing files, the data
is highly compressible now.

    SCENARIO run genbackupdata twice
    WHEN user runs genbackupdata --create=100k foo
    AND user runs genbackupdata --create=100k foo
    THEN directory foo contains 200000 bytes in files
    AND all files in foo are duplicates

# Control file size

The maximum size of output files can be specified. This allows the
user to generate single, very large file, or a large number of small
files.

    SCENARIO control file size
    WHEN user runs genbackupdata --create=100k --file-size=1m bigfile
    THEN directory bigfile contains 1 file

    WHEN user runs genbackupdata --create=1000 --file-size=1 manyfiles
    THEN directory manyfiles contains 1000 files

# Appendix: scenario step implementations

This chapter implements the various scenario steps used in this
manual.

    IMPLEMENTS WHEN user runs genbackupdata --create=(\S+) (.+)
    import cliapp
    import yarnstep
    size = yarnstep.get_next_match()
    args = yarnstep.get_next_match().split()
    opts = args[:-1]
    dirname = yarnstep.datadir(args[-1])
    bin = yarnstep.srcdir('genbackupdata')
    cliapp.runcmd([bin, '--create', size] + opts + [dirname])

    IMPLEMENTS THEN directory (\S+) contains (\d+) bytes in files
    import os
    import yarnstep
    root = yarnstep.get_next_match_as_datadir_path()
    wanted_bytes = yarnstep.get_next_match_as_int()
    total_bytes = sum(
        os.path.getsize(x) for x in yarnstep.iter_over_files(root))
    assert wanted_bytes == total_bytes, \
        '%s != %s' % (wanted_bytes, total_bytes)

    IMPLEMENTS THEN directory (\S+) is about (\d+) bytes when compressed
    import zlib
    import yarnstep
    root = yarnstep.get_next_match_as_datadir_path()
    wanted_bytes = yarnstep.get_next_match_as_int()
    data = ''.join(yarnstep.cat(x) for x in yarnstep.iter_over_files(root))
    compressed = zlib.compress(data)
    size_delta = len(compressed) - len(data)
    assert abs(size_delta) < 1000

    IMPLEMENTS THEN all files in (\S+) are duplicates
    import collections
    import yarnstep
    root = yarnstep.get_next_match_as_datadir_path()
    files = collections.Counter()
    for pathname in yarnstep.iter_over_files(root):
        files[yarnstep.cat(pathname)] += 1
    for data in files:
        assert files[data] == 2

    IMPLEMENTS THEN directory (\S+) contains (\d+) files?
    import collections
    import yarnstep
    root = yarnstep.get_next_match_as_datadir_path()
    wanted_count = yarnstep.get_next_match_as_int()
    file_count = len(list(yarnstep.iter_over_files(root)))
    assert file_count == wanted_count