summaryrefslogtreecommitdiff
path: root/README
blob: c6786e5efda983d728cd609293b7649c761f0a32 (plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
Larch, a Python B-tree library
==============================

This is an implementation of particular kind of B-tree, based on research by 
Ohad Rodeh. See "B-trees, Shadowing, and Clones" (link below) for details on 
the data structure. This is the same data structure that btrfs uses. Note 
that my implementation probably differs from what the paper describes or 
what btrfs implements.

The distinctive feature of this B-tree implementation is that a node is 
never modified. Instead, all updates are done by copy-on-write. Among other 
things, this makes it easy to clone a tree, and modify only the clone, while 
other processes access the original tree. This is utterly wonderful for my 
backup application, and that's the reason I wrote larch in the first place.

(The previous paragraph contains a small lie: larch does modify nodes in
place, if they are not shared between trees. This is necessary for
performance.)

I have tried to keep the implementation generic and flexibile, so that you 
may use it in a variety of situations. For example, the tree itself does not 
decide where its nodes are stored: you provide a class that does that for it. 
I have two implementations of the NodeStore class, one for in-memory and one 
for on-disk storage.

Documentation is sparse. Docstrings and reading the code are your best hope.
There is also an example program. The speed-test benchmark program may
also be useful. (Help and feedback and prodding welcome!)

See the file example.py for an example.

* Homepage: <http://liw.fi/larch/>
* Version control: `git clone git://git.liw.fi/larch`
* Rodeh paper: <http://liw.fi/larch/ohad-btrees-shadowing-clones.pdf>


Stability
---------

The larch on-disk file format and data structures are frozen as of version
0.31. That means any B-trees stored on disk with that version, or any
later version, will be readable by future versions of larch.

This prompts a new version numbering scheme for Larch. In the future,
the version number is of the form `FORMAT.DATE`, where `FORMAT` is
the on-disk format version for Larch (currently 1), and the `DATE`
is the date of the release.


Build and install
-----------------

setup.py should work as usual for distutils. But it might not: I have
only used enough to make the Debian package build.

You can also use the Debian packaging, if on Debian or a derivative.
Just run "debuild -us -uc" to build the package. Or you can get packages
from my site (see <http://liw.fi/code/> for details).


Hacking
-------

The actual tree code is in the larch directory, laid out as a normal
Python package.

* `tree.py` is the actual tree implementation
* `nodes.py` has tree nodes
* `nodestore.py` defines the interface for storing nodes on disk or
  wherever; `nodestore_disk.py` and `nodestore_memory.py` are two 
  implementations of the interface
* `codec.py` handles encoding of nodes for on-disk storage, and decoding too
* `forest.py` handles creation of new trees and committing things to a node
  store

Run `make check` to run the test suite. You will need my CoverageTestRunner
and LRUCache, and extrautils packages:

* <http://liw.fi/lru/>
* <http://liw.fi/coverage-test-runner/>
* <http://liw.fi/extrautils/>
* <http://liw.fi/tracing/>

The unit test modules are paired with their corresponding code modules:
for code module `foo.py` there exists a unit test module `foo_tests.py`.
CoverageTestRunner makes sure each code module's unit test achieve
100% test coverage for that module, not counting explicitly excluded
parts of the code. You can also use `nosetests` to run the tests.

I have two scripts to run simplistic benchmarks:

* `codec-speed 1000 no` tests speed of the NodeCodec class, 
  for encoding tree nodes for on-disk storage.
* `speed-test` tests insert and lookup speeds in trees. Use a `--keys`
  settings that is reasonably large, e.g., a million or ten.

You may want to run either or both before and after making any changes,
to see if you're making things better or worse.

If you have any patches, please send them to me (<mailto:liw@liw.fi>).
I prefer to merge from git repositories, but `git format-patch` or
just plain diffs are always OK.


Bugs and other things to hack
-----------------------------

See <http://liw.fi/larch/bugs/> for the bug tracker.

If you're interested in hacking the larch code, speed improvements
are always interesting. See the bug tracker for things known to be
slow, or run `speed-test` on large numbers of keys, and use profiling 
to see where time is wasted.


Legalese
--------

Copyright 2010, 2011, 2012, 2013  Lars Wirzenius

This program is free software: you can redistribute it and/or modify
it under the terms of the GNU General Public License as published by
the Free Software Foundation, either version 3 of the License, or
(at your option) any later version.

This program is distributed in the hope that it will be useful,
but WITHOUT ANY WARRANTY; without even the implied warranty of
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
GNU General Public License for more details.

You should have received a copy of the GNU General Public License
along with this program.  If not, see <http://www.gnu.org/licenses/>.