bugs/bgproc.mdwn


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103

[[!tag obnam-wishlist]]

Obnam should do some processing in the background, for example uploading
of data to the backup repository. This would allow better use of the
bottleneck resource (network). Below is a journal entry with my thoughts
on how to implement that. It may be out of date by now, but we'll see.
I have a Python module to simplify the use of `multiprocessing` to do
jobs in the background (which avoids the Python global interpreter lock,
in case that matters). --liw

---

Here's a design for Obnam concurrency that came to me the other
day while walking.

The core of Obnam (and larch) is quite synchronous: read data from
file, read B-tree nodes, push chunks and B-tree nodes into repository.
Some of that can be parallelized, but not easily: it's already tricky
code, and making it even more tricky is going to require very strong
justification.

Things like encrypting and decrypting files need to be done in parallel
with other things, for speed. These things are not really in the
core, and indeed are provided by plugins.

So here's a way to them in parallel:

* the core code stays synchronous, the way it is now
* whenever larch code needs to read a B-tree node,
  it blocks until it gets it
* the node is read, synchronously, from wherever, and put into
  a background processing queue (using Python multiprocess)
* the code that waits for the node to be processed polls the queue,
  and handles any other background jobs that happen to finish while
  it waits, and returns the desired node when it gets it
* when larch writes a node (after it gets pushed out
  of the upload queue inside larch), it is put into a background
  processing queue
* at the same time, if there were any finished background jobs, they're
  handled (written to repo)
* at the end of the run, the main loop makes sure any pending background
  jobs finish and are handled

There's a complication that the B-tree code may need a node that is
not yet written to the repository, since it is still going through
a background processing queue.

I'm going to need to restructure how hooks process files that
are written to or read from the repository. Writing should happen
asynchronously: files are put in a queue and processed in the
background, and then written to the actual repository when background
processing is finished. Reading needs to happen synchronously, since
there's a B-tree call waiting for them, but to handle the case of 
needing a node that is still being processed
in the background, we need to keep track of what nodes are in the
background, and wait for them to be done before reading them.

Reading would thus be something like this, implemented in the
`Repository` class:

    while wanted file is in write queue:
        process a write queue result
        
    read file from repository
    process file data through hooks
    return file

The write queue is more complicated (again handled somehow in the
`Repository` class):

* a `multiprocessing.Queue` instance for holding pending jobs
  - a job is a (pathname, file contents) pair
* another `Queue` instance for holding unhandled results
  - (pathname, file contents) pair, where the contents may have changed
* a `set` for holding file identifiers (paths) that have been put into
  the pending jobs queue, but not yet processed from the results queue

Each plugin can provide one or more Unix commands (filters) through
which the file contents gets piped. The background processes run each
filter in turn, giving the output of the previous one as input to the
next one.

To handle a result from a background job, the following needs to be done:

* remove the pathname from the `set`
* write the filtered file contents into the repository

To implement this, I'll do this:

* All changes should be in `HookedFS`
* `write_file` and `overwrite_file` put things into the pending jobs queue,
  and also call a new method `handle_background_results`
* `cat` gets changed to wait for files in the write queue, calling
  `handle_background_results`
* `handle_background_results` will do what is needed

This design isn't optimal, since writing things to the repository
isn't being done in parallel with other things, but I'll tackle that
problem later.


[[done]] this clearly isn't happening, so closing the old wishlist
bug --liw