# Introduction Software development is a security risk. Building software from source code and running it is a core activity of software development. Software developers do it on the machine they work on. Continuous integration systems do it on server. These are fundamentally the same. The process is roughly as follows: * install any dependencies * build the software * run the software to perform automated tests on it When the software is run, even if only a small unit of it, it can do anything that the person running the build can do. For example, it can do any and all of the following, unless constrained: * delete files * modify files * log into remote hosts using SSH * decrypt or sign files with PGP * send email * delete email * commit to version control repositories * do anything with the browser that a the person could do * run things as root using sudo * in general, cause mayhem and chaos Normally, a software developer can assume that the code they wrote themselves won't ever do any of that. They can even assume that people they work with make code that won't do any of that. In both cases, they may be wrong: mistakes happen. It's a well-guarded secret among programmers that they sometimes, even if rarely, make catastrophic mistakes. Accidents aside, mayhem and chaos may be intentional. Your own project may not have malware, and you may have vetted all your dependencies, and you trust them. But your dependencies have dependencies, which have further dependencies, which have dependencies of their own. You'd need to vet the whole dependency tree. Even decades ago, in the 1990s, this could easily be hundreds of thousands of lines of code, and modern systems a much larger. Note that build tools are themselves dependencies, as is the whole operating system. Any code that is invoked in the build process is a dependency. How certain are you that you can spot malicious code that's intentionally hidden and obfuscated? Are you prepared to vet any changes to any transitive dependencies? Does this really matter? Maybe it doesn't. If you can't ever do anything on your computer that would affect you or anyone else in a negative way, it probably doesn't matter. Most software developers are not in that position. This risk affects every operating system and every programming language. The degree in which it exists varies, a lot. Some programming language ecosystems seem more vulnerable than others: the nodejs/npm one, for example, values tiny and highly focused packages, which leads to immense dependency trees. The more direct or indirect dependencies there are, the higher the chance that one of them turns out to be bad. The risk also exists for more traditional languages, such as C. Few C programs have no dependencies. They all need a C compiler, which in turns requires an operating system, at least. The risk is there for both free software systems, and non-free ones. As an example, the Debian system is entirely free software, but it's huge: the Debian 10 (buster) release has over 50 thousand packages, maintained by thousands of people. While it's probable that none of those packages contains actual malware, it's not certain. Even if everyone who helps maintain is completely trustworthy, the amount of software in Debian is much too large for all code to be comprehensively reviewed. This is true for all operating systems that are not mere toys. The conclusion here is that to build software securely, we can't assume all code involved in the build to be secure. We need something more secure. The Contractor aims to be a possible solution. ## Links to attacks * [Malicious npm package opens backdoors on programmers' computers](https://www.zdnet.com/article/malicious-npm-package-opens-backdoors-on-programmers-computers/) ## Threat model This section collects a list of specific threats to consider. * accessing or modifying files not part of the build * excessive use build host resources * e.g., CPU, GPU, RAM, disk, etc * this might happen to make unauthorized use of the resources, or to just be wasteful * excessive use of network bandwidth * attack on a networked target via a denial of service attack * e.g., build joins a DDoS swarm, or sends fabricated SYN packets to prevent target from working * attack on build host, or other host, via network intrusion * e.g., port scanning, probing for known vulnerabilities * attack build host directly without network * e.g., by breaching security isolation using build host kernel or hardware vulnerabilities, or CI engine vulnerabilities * this includes eavesdropping on the host, and stealing secrets ## Status of this document Everything about the Contractor is in its early stages of thinking, sketching, experimentations, and planning. Nothing is nailed down yet. Pre-ALPHA. Don't trust anything. Anything you trust may be used against you. Anything may change. # Requirements This chapter discusses the requirements for the Contractor solution. The requirements are divided into two parts: one that's based on the threat model, and another for requirements that aren't about security. The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in [RFC 2119][]. [RFC 2119]: https://tools.ietf.org/html/rfc2119 ## Security requirements These requirements stem from the threat model above. * **FilesystemIsolation**: The Contractor MUST prevent the build from accessing or modifying any files outside the build. Build tools and libraries outside the source tree MUST be usable. * The Contractor MUST prevent the build from using more than the user-specified amount of CPU time (**HardCPULimit**), disk space (**HardDiskLimit**), or network bandwidth (**HardBandwidthLimit**). Any attempt by the build to use more should fail. The Contractor MUST fail the build if the limits are exceeded. * **HardRAMLimit**: The Contractor MUST prevent the build from using more than the user-specified amount of RAM. The Contractor MAY fail to build if the limit is exceeded, but is not required to do so. * **ConstrainNetworkAccess**: The Contractor MUST prevent the build from accessing the network ourside the build environment in ways that haven't been specifically allowed by the user. The contractor SHOULD fail the build if it makes an attempt at such access. The user MUST be able to specify which hosts to access, and using which protocols. * **HostProtection**: The Contractor SHOULD attempt to protect the host its running on from non-networked attacks performed by the build. This includes vulnerabilities of the host's operating system kernel, virtualisation solution, and hardware. ## Non-security requirements * **AnyBuildOS**: Builds SHOULD be able to run in any operating system that can be run as a virtual machine guest of the host operating system. * **NoRoot**: Running the Contractor SHOULD NOT require root privileges. It's OK to require sufficient privileges to use virtualisation. * **DefaultBuilder**: The Contractor SHOULD be easy to set up and to use. It should not require extensive configuration. Running a build should be as easy as running **make**(1) on the command line. It should be feasible to expect developers to use the Contractor for their normal development work. # Architecture This chapter discusses the architecture of the solution, with particular emphasis on threat mitigation. The overall solution is use of nested virtual machines, running on a developer's host. The outer VM runs the Contractor itself, and is called "the manager VM". The inner VM runs the build, and is called the "worker VM". The manager VM controls the worker VM, proxies its external access, and prevents it from doing anything nefarious. The manager VM is managed by a command line tool. Developers only interact directly with the command line tool. ~~~dot digraph "arch" { labelloc=b; labeljust=l; dev [shape=octagon label="Developer"]; img [shape=tab label="VM image"]; src [shape=tab label="Source tree"]; ws [shape=tab label="Exported workspace"]; apt [shape=tab label="APT repository"]; subgraph cluster_host { label="Host system \n (the vulnerable bit)"; contractor [label="Contractor CLI"]; subgraph cluster_contractor { label="Manager VM \n (defence force)"; manager; libvirt; subgraph cluster_builder { label="Worker VM \n (here be dragons)"; style=filled; fillcolor="#dd0000"; guestos [label="Guest OS"]; } } } dev -> contractor; contractor -> manager; contractor -> guestos; img -> contractor; ws -> contractor; src -> contractor; apt -> guestos; manager -> libvirt; libvirt -> guestos; contractor -> ws; } ~~~ This high-level design is chosen for the following reasons: * it allows the build to happen in any operating system (**AnyBuildOS**) * the Contractor is a VM and running it doesn't require root privileges; it has root inside both VMs if needed; all the complexity of setting things up so the worker VM works correctly are contained the manager VM, and the user need do only minimal configuration (**NoRoot**) * the command line tool for using the Contractor can be made to be as easy as any build tool so that developer actually use it by default (**DefaultBuilder**) * the manager VM can monitor and control the build (**HardCPULimit**, **HardBandwidthLimit**) * the manager can supply the worker VM with only a specified amount of RAM and disk space (**HardRAMLimit**, **HardDiskLimit**) * the manager can set up routing and firewalls so that the worker VM cannot access the network, except via proxies provided by the outer VM (**ConstrainNetworkAccess**) * the nested VMs provide a smaller attack surface than the Linux kernel API, and this protects the host better than Linux container technologies, although it doesn't do much to protect against virtualisation or hardware vulnerabilities (**HostProtection**) ## Build process The architecture leads to a build process that would work roughly like this: * the manager VM is already running * developer runs command line tool to do a build: `contractor build foo.yaml` * command line tool copies the worker VM image into the manager VM * command line tool boots the worker VM * command line tool installs any build dependencies into the worker VM * command line tool copies a previously saved dump of the workspace into the worker VM * command line tool copies the source code and build recipe into the worker VM's workspace * command line tool runs build commands in the worker VM, in the source tree * command line tool copies out the workspace into a local directory * command line tool reports to the developer build success or failure and where build log and build artifacts are ## Implementation sketch (FIXME: update) FIXME: write this # Acceptance criteria This chapter specifies acceptance criteria for the Contractor, as *scenarios*, which also define how the criteria are automatically verified. ## Local use of the Contractor These scenarios use the Contractor locally, to make sure it can do things that don't require the VM. ### Parse build spec Make sure the Contractor can read a build spec and dump it back out, as JSON. This exercises the parsing code. JSON output is chosen, instead of YAML, to make sure the program doesn't just copy input to output. ~~~scenario given file dump.yaml when I invoke contractor dump dump.yaml then the JSON output matches dump.yaml ~~~ ~~~{.file #dump.yaml .yaml .numberLines} worker-image: worker.img ansible: - hosts: worker remote_user: worker become: true tasks: - apt: name: build-essential vars: ansible_python_interpreter: /usr/bin/python3 source: . workspace: workspace build: | ./check ~~~ ## Smoke tests These scenarios build a simple "hello, world" C application on a variety of guest systems, and verify the resulting binaries output the desired greeting. The goal of these scenarios is to ensure the various Contractor components fit together at least in the very basic case. ### Debian smoke test This scenario checks that the developer can build a simple C program in the Contractor. ~~~disabled-scenario given a working contractor and file hello.c and file hello.yaml and file worker.img from source directory when I run contractor build hello.yaml then exit code is 0 then file ws/src/hello exists ~~~ ~~~{.file #hello.c .c .numberLines} #include int main() { printf("hello, world\n"); return 0; } ~~~ ~~~{.file #hello.yaml .yaml .numberLines} worker-image: worker.img ansible: - hosts: worker remote_user: worker become: true tasks: - apt: name: build-essential vars: ansible_python_interpreter: /usr/bin/python3 source: . workspace: ws build: | gcc hello.c -o hello ./hello ~~~ --- title: "Contractor: build software securely" author: "Lars Wirzenius" bindings: - subplot/contractor.yaml - subplot/vendor/runcmd.yaml - subplot/files.yaml functions: - subplot/contractor.py - subplot/vendor/runcmd.py - subplot/files.py template: python documentclass: report classes: - c - disabled-scenario abstract: | Building software typically requires running code downloaded from the Internet. Even when you're building your own software, you usually depend on libraries and tools, which in turn may depend on further things. It is becoming infeasible to vet the whole set of software running during a build. If a build includes running local tests (unit tests, some integration tests), the problem gets worse in magnitude, if not quality. Some software ecosystems are especially vulnerable to this (nodejs, Python, Ruby, Go, Rust), but it's true for anything that has dependencies on any code from outside its own code base, and even if all the dependencies come from a trusted source, such as the operating system vendor or a Linux distribution. The Contractor is an attempt to be able to build software securely, by leveraging virtual machine technology. It attempts to be secure, convenient, and reasonably efficient. ...