From 4596ce6314bfa88340466e7f405dcd392e0d4ebe Mon Sep 17 00:00:00 2001 From: Lars Wirzenius Date: Mon, 20 May 2019 19:48:22 +0300 Subject: Drop: generated versions --- ci-arch.html | 318 ----------------------------------------------------------- ci-arch.pdf | Bin 289114 -> 0 bytes 2 files changed, 318 deletions(-) delete mode 100644 ci-arch.html delete mode 100644 ci-arch.pdf diff --git a/ci-arch.html b/ci-arch.html deleted file mode 100644 index 9612eda..0000000 --- a/ci-arch.html +++ /dev/null @@ -1,318 +0,0 @@ - - - - - - - - Thoughts on WMF CI architecture - - - -
-

Thoughts on WMF CI architecture

-

Lars Wirzenius, Release Enginering

-

WORK IN PROGRESS

-
- -

Revisions of this document

- -

Introduction

-

CI WG plans replacement of its current WMF CI system with one of Argo, GitLab CI, Zuul v3. These were selected in the first phase of the CI WG.

-

We aim to do “continuous deployment”, not only “continous integration” or “continuous delivery”. The goal is to deploy changes to production as often and as quickly as possible, without compromising on the safety and security of the production environment.

-

This document goes into more detail of how the new CI system should work, without (yet) discussing which replacement is chosen. A meta architecture if you wish.

-

It is assumed as of the writing of this document that future CI will build on and deploy to containers orchestrated by Kubernetes.

-

An important change is that we aim to change things so that as much as possible, all software deployments are to containers orchestrated by Kubernetes

-

Vision for CI

-

This is Lars’s personal opinion, for now, but it’s based on discussions with various people while at WMF. It’s not expected to be new, radical, or controversial, compared to status quo.

-

In the future, CI at WMF serves WMF, its developers, and the Wikipedia movement by making software development more productive, more confident, and faster. The cycle time of changes (the time from idea to running in production) is short: for a trivial change, as little as five minutes. At the same time, the safety and security of production is protected: malicious changes do not get deployed, mistakes are rare, and can easily be fixed or the problematic change reverted.

-

Production here means all the software needed to run all the sites (Wikipedias in different languages, Commons, etc), as well as supporting services, including tooling and services that supports development.

-

Overall solution approach

-

The overall approach to the architecture of the CI system, and the workflow supported by it, is to keep all changes in version control (git), which includes code, configuration, and scripts for building and deploying. When a change to version control is pushed, CI builds and tests the change, humans review the change, and if all seems to be in order, CI deploys to production.

-

Stakeholders

-

Stakeholders in the WMF CI system include:

- -

Requirements

-

This chapter lists the requirements we have for the CI system and which we design the system to fulfil.

-

Each requirement is given a semi-mnemonic unique identifier, so it can be referred to easily.

-

The goal is to make requirements be as clear and atomic as possible, so that the implementation can be more easily evaluated against the requirement: it’s better to split a big, complicated requirement into smaller ones so they can be considered separately. Requirements can be hierarchical: The original requirement can be a parent to all its parts.

-

FIXME: We may want to have a way to track which requirements are being fulfilled, or tested by automated acceptance tests. Need to add something for this, maybe a spreadsheet.

-

These requirements were originally written up in the WG wiki pages and have been changed a little compared to that (as of the 21 March 2019 version).

-

Very hard requirements

-

These are non-negotiable requirement that must all be fulfilled by our future CI system.

- -

Hard requirements

-

These are not absolute requirements, and can be negotiated, but only to a minor degree.

- -

Softer requirements

-

These requirements are even more easily negotiated.

- -

Would be nice

-

These are so soft they aren’t even requirements, and more wish list items.

- -

Architecture

-

The WMF development ecosystem

-
-The WMF development ecosystem, roughly
The WMF development ecosystem, roughly
-
-

The figure above is simplistic, but gives the general idea of what happens when a developer is finished with a change:

-
    -
  1. developer pushes a change to Gerrit, which trigger CI
  2. -
  3. CI builds and tests change (commit stage)
  4. -
  5. CI deploys to a test environment, runs tests against that (acceptance test stage); if everythins is OK, Gerrit is notified and requests code reviews from relevant parties
  6. -
  7. testers can request CI it deploy the change to an environment dedicated for manual testing
  8. -
  9. after a successful code review, CI merges changes to the master branch, runs all automated tests again, and deploys to the production environment
  10. -
-

The commit and acceptance stages are triggered as soon as developer pushes changes to be reviewed. Human reviews won’t be requested automatically until the two stages pass, as there’s no point in spending human attention on things that are not going to be candidates for deployment to production. Developer may request reviews of work-in-progress changes when they want. The two stages may be re-run after code review, to make sure nothing unforeseen has changed while the review took place.

-

Other stages may run in parallel with code review, and if they fail they may nullify the release candidacy of the change. For example, stages for manual and capacity testing, and security test/review; depending on the change and the component in question, some or all of these may be necessary.

-

Normal change to an individual component

- -

Interdependent changes

- -

Security embargoed change

- -

Log storage

- -

Artifact storage

- -

Credentials management and access control

- -

Credentials and other secrets are needed to allow access to servers, services, and files. They are highly security sensitive data. The CI system needs to protect them, but allow controlled use of them.

-

Example: a CI job needs to deploy a Docker image with a tested and reviewed change as a container orchestrated by production Kubernetes. For this, it needs to authenticate itself to the Kubernetes API. This is typically done by a username/password combination, but might be an API token of some kind (though it doesn’t really matter; it’s all just secret bits at some level). How will the future CI system handle this?

-

Example: for tests, and in production, a MediaWiki container needs access to a MariaDB database, and MW needs to authenticate itself to the database. MW gets the necessary credentials for this from its configuration, which CI will install during deployment. The configuration will be specific for what the container is being used: if it’s for testing a change, the configuration only allows access to a test database, but for production it provides access to the production database.

-

Builds are done in isolated containers. These containers have no credentials. Build artifacts are extracted from the containers and stored in an artifact storage system by the CI system, and this extraction is done in a controlled environment, where only vetted code is run, not code from the repository being tested. The build environment can’t push artifacts directly to the artifact store.

-

Deployments happen in controlled environments, with access to the credentials needed for deployment. The deployment retrieves artifacts from the artifact storage system. The deployments are to containers, and the deployed containers don’t have any credentials, unless CI has been configured to install them, in which case CI installs the credentials for the intended use of the container.

-

Note that credentials should not come directly from the source code of the deployed program. CI deploys configuration when it deploys the software. This way, the same software (build artifacts) can be deployed to different environment. (This may be complicated by the way MediaWiki is configured, using a PHP file in the source tree. This will need discussion.)

-

Tests run against software deployed to containers, and those containers only have access to the backing services needed for the test, and may even be firewalled to not have access to any other network locations.

-

Suggestion: Deployments will be done dedicated deployment environments, which run a “pingee” service. When a pipeline executes a deployment stage, deploying to any environment, the stage runs in a suitable container, but doesn’t actually do the deployment itself. Instead, it “pings” a deployment service, with information of who is deploying, what, and where, and the deployment service inspects the change, and if it looks acceptable, does the actual deployment to the desired environment. The deployment service has access to the credentials it needs for accessing the artifacts and doing the deployment. There may be several deployment services, for deploying to environments with different security needs.

-

The CI pipeline

-
-The default pipeline
The default pipeline
-
-

CI will provide a default pipeline for all projects. Projects may use that or specify another one.

-

The pipeline will be divided into several stages. Mandatory stages for all changes and all projects are commit, acceptance stage, and deployment to production. Other stages may be added to specific changes projects as needed.

-

The goal is that if the commit and acceptance stages pass, the change is a candidate that can be deployed to production, unless the project is such that it needs (say) manual testing or other human decision for the production deployment decision. Likewise, if the component or the change is particularly security or performance sensitive, stages that check those aspects may be required. CI will have ways of indicating the required changes per component, and also per change. (It is unclear how this will be managed.)

-

If the commit or acceptance stage fails, there is not production candidate. The pipeline as a whole fails. Any artifacts built by the pipeline will not be deployable to production, but they may be deployable to test environments, or downloaded by developers for inspection.

-

The commit stage

-

The commit stage builds any deployable artifacts, such as executable binaries, minimized Javascript, translation files, or Docker images. It is important that artifacts don’t get rebuilt by later stages, because rebuilding does not always result in bitwise identical output. Instead the goal is to build once, test the artifacts, and deploy the tested artifacts, instead of rebuilding and maybe deploying something different than what was tested.

-

The commit stage also runs unit tests, and any other tests that can be run in isolation from other parts of the system, and that also are quick. The commit stage does not have access to backing services, such as databases or other components of the overall system. For example, when the pipeline processes a change to a MediaWiki extenasion, the commit stage doesn’t have access to MediaWiki core or the MariaDB MediaWiki uses. Integration or system tests should be done in the acceptance test stage.

-

The commit stage also runs code health checks.

-

The commit stage is expected to be fast, aiming at less than five minutes, so that we can expect developers to wait for it to pass successfully. This will be a new requirement on our developers.

-

The commands to build (compile) or run automated tests are stored in the repository, either explicity, or by indicating the type of build needed. There might be a .pipeline/config.yaml file in the repository, which specifies that make is the command that builds the artifacts. Otherwise, the file may specify that it’s a Go project, and CI would know how to build a Go project. In this case we can change the commands to build a Go project by changing CI only, without having to change each git repository with a Go program.

-

Only the declarative style will be possible for building Docker images, as we want control over how that is done (SECURE requirement).

-

CI may enforce specific additional commands to run, to build or test further things; this can be used by RelEng to enforce certain things. For example, we may enforce code health checks, or to enable (or disable) debug symbols in all builds. Such enforcement will be done in collaboration with our developers.

-

Any build dependencies needed during the commit stage must be specified explicitly. For example, the minimum required version of Go that should be installed in the build environment would be a build dependency. If a project build-depends on another project, it needs to specify which project, and which artifacts it needs installed from the other project. Explicit build-dependencies is more work, but results in fewer problems due to broken heuristics.

-

The acceptance stage

-

During the acceptance stage CI deploys artifacts built in the commit stage to a production-like system that has the same versions of all sofware as production, except for the changes being processed by the pipeline. CI will then run automated acceptance tests, and other integration and system tests, against the deployed software. The test environment is clean and empty, and well-known, unless and until the test suite inserts data or makes changes.

-

The acceptance stage can take time. Developers are not expected to wait until it is finished before they move on to working on something else.

-

Deployment to production

-

If prior stages have passed successfully, and manual code review (“Gerrit CR:+2 vote”) has approved the change, this stage deploys the change to production.

-

Manual tests

-

Testers may instruct CI to deploy any recent built set of artifacts to a dedicated test environment, and can use the software in that environment where it is isolated from others, and won’t suddenly change underneath them. The details of how this will be implemented are to be determined later.

-

This feature of the CI can also be used to demonstrate upcoming features that are not yet ready to be deployed to or enabled in production.

-

Capacity tests, non-functional requirements

-

Capacity tests, and other tests for non-functional requirements, will also be done in dedicated, isolated production-like environments. RelEng will work with the performance team to sort out the details.

-

CI implementation

-

FIXME This needs to be written, but it needs a lot of thinking first

-

Automated acceptance tests

- -

Transitioning to new CI system

-

FIXME This needs to be written.

- - diff --git a/ci-arch.pdf b/ci-arch.pdf deleted file mode 100644 index bac881f..0000000 Binary files a/ci-arch.pdf and /dev/null differ -- cgit v1.2.1