ci-arch.mdwn


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442

TITLE: Future WMF CI architecture


# Introduction

* CI WG plans replacement of its current WMF CI system with one of
  Argo, GitLab CI, Zuul v3.

* This document goes into more detail of how the new CI system should
  work, without (yet) discussing which replacement is chosen. A
  meta-level architecture if you wish.

* It is assumed as of the writing of this document that future CI will
  build on and deploy to containers orchestrated by Kubernetes.

# Requirements

* This chapter lists the requirements we have for the CI system and
  which we design the system to fulfil.

* Each requirement is given a semi-mnemonic unique identifier, so it
  can be referred to easily.

* The goal is to make requirements be as clear and atomic as possible,
  so that the implementation can be more easily evaluated against the
  requirement: it's better to split a big, complicated requirement
  into smaller ones so they can be considered separately. The original
  requirement can be a parent to all its parts.

* FIXME: We may want to have a way to track which requirements are
  being fulfilled, or tested by automated acceptance tests. Need to
  add something for this, maybe a spreadsheet.

* These requirements were originally written up at
  <https://www.mediawiki.org/wiki/Wikimedia_Release_Engineering_Team/CI_Futures_WG/Requirements>
  and have been changed a little compared to that (as of the 21 March
  2019 version).

## Very hard requirements

* These are non-negotiable requirement that must all be fulfilled by a
  our future CI system.

* **(SELFHOSTABLE)** Must be hostable by the Foundation. It's not
  acceptable for WMF to rely on an outside service for this.

    * **(FREESOFTWARE)** Must be free software / open source. "Open
      core" like GitLab is be good enough, as long as we only need the
      parts that provide software freedom.

      This is partly due to the **SELFHOSTABLE** requirement, but also
      because a WMF value is to prefer open source.

* **(GITSUPPORT)** Must support git. We're not switching version control
  systems for CI.

* **(UNDERSTANDABLE)** Must be understandable without too much effort to
  our developers so that they can use CI/CD productively.

* **(SELFSERVE)** Must support self-serve CI, meaning we don't block
  people if they want CI for a new repo. Due to **PROTECTPRODUCTION**,
  there will probably to be some human approval requirement for new
  projects, but as much as possible, people should be allowed to do
  their work without having to ask permission.

    * **(SELFSERVE2)** Should allow the developers to define or
      declare at least parts of the pipeline jobs in the repository:
      what commands to run for building, testing, etc.

## Hard requirements

* These are not absolute requirements, and can be negotiated, but only
  to a minor degree.

* **(FAST)** Must be fast enough that it isn't perceived as a bottleneck
  by developers. We will need a metric for this.

    * **(SHORTCYCLETIME)** Must enable us to have a short cycle time
      (from idea to running in production). CI is not the only thing
      that affects this, but it is an important factor. We probably
      need a metric for this.

* **(TRANSPARENT)** Must make its status and what-is-going-on visible so
  that its operation can be monitored and so that our developers can
  check the status of their builds themselves. Also the overall status
  of CI, for example, so they can see if their build is blocked by
  waiting on others.

* **(FEEDBACK)** Must provide feedback to the developers as early as
  possible for the various stages of a build, especially the early
  stages ("can get source from git", "can build", "can run unit
  tests", etc.).

  The goal is to give feedback as soon as possible, especially in the
  case of the build failing.

* **(FEEDBACK2)** Must support providing feedback via Gerrit, IRC, and
  Phabricator, at the very least. These are our current main feedback
  channels.

* **(SECURE)** Must be secure enough that we can open it to community
  developers to use without too much supervision.

* **(MAINTAINED)** Must be maintained and supported upstream. The CI
  system should not require substantial development from the
  Foundation. Some customization is expected to be necessary.

* **(MANYREPOS)** Must be able to handle the number of repositories,
  projects, builds, and deployments that we have, and will have in the
  foreseeable future.

* **(METRICS)** Must enable us to instrument it to get metrics for CI use
  and effectiveness as we need. Things like cycle times, build times,
  build failures, etc.

* **(GERRIT)** Must work with Gerrit as well as other self-hostable
  code-review systems (e.g., GitLab), if we decide to move to that
  later. This means, code review happens on Gerrit, after building and
  automated tests pass, and positive code review triggers deployment
  to production.

* **(NOREBUILDING)** Must promote (copy) Docker images and other build
  artifacts from "testing" to "staging" to "production", rather than
  rebuilding them, since rebuilding takes time and can fail. Once a
  binary, Docker image, or other build artifact has been built,
  exactly that artifact should be tested, and eventually deployed to
  production.

* **(LOCALTESTS)** Must allow developer to replicate locally the tests
  that CI runs. This is necessary to allow lower friction in
  development, as well as to aid debugging. For example, if CI builds
  and tests using Docker container, a developer should be able to
  download the same image and run the tests locally.

* **(AUTOMATEDEPLOYMENT)** Must allow deployment to be fully automated.

    * **(AUTOMATEDSELFDEPLOYMENT)** Must be automatically deployable
      by us or SRE, onto a fresh server.

* **(HSCALABLE)** Must be horizontally scalable: we need to be able to add
  more hardware easily to get more capacity. This is particularly
  important for build workers, which are the mostly likely bottleneck.
  Also, probably environments used for testing.

* **(PROGLANGS)** Must be able to support all programming languages we
  currently support or are likely to support in the future. These
  include, at least, shell, Python, Ruby, Java, PHP, and Go. Some
  languages may be needed in several versions.

* **(OUTPUTLINKS)** Must support HTTP linking to build results for
  easier reference and discussion. This way a build log, or a build
  artifact, can be reference using a simple HTTP (or HTTPS) link.

* **(ARTIFACTARCHIVE)** Should allow archiving build logs,
  executables, Docker images, and other build artifacts for a long
  period.

    * **(RETENTION)** The retention period should be configurable
      based on artifact type, and whether the build ended up being
      deployed to production.

* **(CONFIGVC)** Must keep configuration in version control. This is
  needed so that we can track changes over time.

* **(GATING)** Must support gating / pre-merge testing. FIXME: This
  needs to be explained.

* **(PERIODICBUILDS)** Must support periodic / scheduled testing. This
  is needed so that we can test that changes to the environment
  haven't broken anything. An example would be changes to Debian, upon
  which we base our container images.

* **(POSTMERGETESTS)** Must support post-merge testing. FIXME: This
  needs to be explained.

* **(CIMERGES)** Must support tooling to do the merging, instead of
  developers. We don't want developer merging by hand and pushing the
  merges. CI should test changes and merge only if tests pass, so that
  the branches for main lines of development are always releaseble.

* **(TESTVC)** Must support storing tests in version control. This is
  probably best achieved by having tests be stored in the same git
  repository where the code is.

* **(BUILDDEPS)** Must have some way to declare dependent repositories /
  software needed for testing. FIXME: This needs to be explained.

* **(TESTSERVICES)** Must support services for tests — i.e., some
  PHPUnit tests require MySQL. These are most important for
  integration tests. Proper unit tests do not depend on any external
  stuff. However, integration tests may well need MediaWiki, some
  specific extensions, and backing services, such as databases, "oid"
  services, and possibly more. CI needs to be able to provide such
  environments for testing.

* **(OTHERGITORTICKETING)** Must allow changing git repository, code
  review, and ticketing systems from Gerrit and Phabricator. We are
  not currently looking at switching away from Gerrit and Phabricator,
  but the future CI solution should not lock us into specific code
  review or ticketing solutions.

* **(PROTECTPRODUCTION)** Must protect production by detecting problems
  before they're deployed, and must in general support a sensible
  CI/CD pipeline. This is necessary both for the safety and security
  of our production systems, a higher speed of development, and higher
  productivity. The protection brings developer confidence, which
  tends to bring speed and productivity.

    * **(ENFORCETESTS)** Must allow Release Engineering team to
      enforce tests on top of what a self-serving developer specifies,
      to allow us to set minimal technical standards.

* **(CACHEDEPS)** Must support dependency caching – we have castor, maybe
  we could do better? Maybe some CI systems have this figured out?
  This means, for example, caching npm and PyPI packages so that every
  build doesn't need to download them directly from the centralised
  package repositories. This is needed for speed.

## Softer requirements

* These requirements are even more easily negotiated.

* **(HA)** Should be highly available - can restart any component without
  disrupting service.

* **(LIVELOG)** Should have live console output of build.

* **(MAXBUILDTIME)** Should have build timeouts so that a build may
  fail if it takes too long. Among other reasons, this is useful to
  automatically work around builds that get "stuck" indefinitely.

* **(CLEANWORKSPACE)** Should provide a clean workspace for each test
  run - either a clean VM or container.

* **(RATELIMIT)** Should have rate limiting - one user/project can not
  take over most/all resources.

* **(CHECKSIG)** Should support validation and creation of GPG/PGP-signed
  git commits

* **(SECRETS)** Should support secure storage of credentials / secrets.

## Would be nice

* These are so soft they aren't even requirements, and more wish list
  items.

* **(LIMITBOILERPLATE)** Would be nice for test abstractions to limit
  boiler-plate, i.e., all of our services are tested roughly the same
  way without having to copy instructions to every repository.

* **(PRIORITIZEJOBS)** Would be nice to prioritize jobs.

     * Use case: if there is a queue of jobs, there should be some
       mechanism of jumping that queue for jobs that have a higher
       priority.

     * We currently have a Gating queue that is a higher priority than
       periodic jobs that calculate Code Coverage.

* **(ISOLATION)** Would be nice to support isolation / sandboxing.

     * Jobs should be isolated from one another.

     * Jobs should be able to install apt-packages without affecting
       dependencies of other jobs.

* **(CONTROLAFFINITY)** Would be nice to have configurable job
  requirements/affinity.

     * Be able to schedule a job only on nodes that have at least X
       available disk space/ram/cpu/whatever OR try to schedule on
       nodes where a current build of this job isn't already running.

* **(POSTMERGEBISECT)** Would be nice to post-merge git-bisect to find
  patch that caused a particular problem with a Selenium test.

* **(DEPLOYWHEREVER)** Would be nice to have a mechanism for
  deployment to staging, production, pypi, packagist, toollabs. We
  could do with a way to deploy to any of several possible
  environments, for various use cases, such as bug repoduction, manual
  exploratory testing, capacity testing, and production. FIXME: what
  do pypi and packagist do in the list?

* **(MATRIXBUILDS)** Would be nice to have efficient matrix builds.

     * E.g., we currently run phpunit tests and browser tests for the
       Cartesian product of {PHP7 PHP7.1 PHP7.2 HHVM} x {MySQL,
       SQLite, PostgreSQL} x {Composer, MediaWiki vendor}, but we
       perform setup/git clone for all of those tests. Doing that in a
       space and time efficient way would be good.

* **(MOBILE)** Would be nice to support building and testing mobile
  applications (at minimum for iOS and Android).

* **(EMBARGO)** Would be nice to be able to run for secret/security
  patches. This means CI should be able to build and deploy changes
  that can't be made public yet, for security embargo reasons.

# Important use cases

These are some of the important use cases for the CI system, and how
we plan CI to implement them.

## Normal change to an individual component

* a developer pushes a change to one program that runs in production

* the change is indepent of other changes and no other component
  depends on the chage

* e.g., bug fix, not a feature change

* commit stage and acceptance stage passing, plus a positive code
  review, is enough to deploy this to production

* developer pushes change, this trigger commit and acceptance stages,
  which pass, which triggers code review requests to be sent to
  reviewers

* reviewers vote +2, which triggers a deployment to production

* this is the simplest possible use case for CI

## interdependent changes

* changes to two or more components that must all be applied at once
  or not at all, e.g., to mediawiki core and an extension

## Security embargoed change

* change can't be public until it's deployed or manually made public

# Design of specific aspects 

## Log storage

## Artifact storage

* need to store arbitrary blobs for some time

* longer time for anything that gets deployed to production, shorter
  for everything else?

* de-duplicate to save on space?

* can these be publically accessible? sometimes not?

* artifact storage must be secure, as everything that gets deployed to
  production goes via it

## Credentials management

* what are the requirements and use cases here?

* deployment to K8s vs to bare metal servers?

## Interdependent changes to multiple components

* For example, change to MediaWiki core and an extension so that they
  depend on each other.

* Isn't this bad practise? Would this be better done using feature
  flags or feature detection so that code changes can be merged
  independently, and only enabled in environments where all changes
  are present?

* Otherwise, CI needs to support, either automatically or manually,
  merging all repos at once or none.

# The (default?) pipeline

* CI will provide a default pipeline for all projects

  * divided into several stages

  * mandatory stages: commit, acceptance; other stages may be added to
    other projects as needed

  * the goal is that if commit + acceptance stages pass, the project
    has a candidate that can be deployed to production, unless the
    project is such that it needs (say) manual testing or other human
    decision for the production deployment decision

  * if commit or acceptance stage fails, there is not production
    candidate

* commit stage

  * builds all the artifacts that will be used by later stages

  * all tests run in an isolated build tree, and may not use anything
    outside the tree, including databases or other backing services

  * runs unit tests

  * other tests, possibly integration tests

  * code health checks

  * is fast (aim at less than five minutes)

* acceptance tests

  * automated acceptance tests

  * deploy artifacts from commit stage, run tests to deployed
    artifacts

  * possibly run slow tests from the build tree as well, if they don't
    fit into the commit stage's time budget

* capacity tests

* manual (exploratory) tests

  * testers will be able to deploy any candidate a test environment,
    with the push of a button; the test env is very like production,
    except for capacity and possibly data

# Architecture: CI in an ecosystem

* code review will be done in Gerrit or otherwise outside the CI
  pipeline

* the commit and acceptance stages are triggered as soon as developer
  pushes changes to be reviewed; human reviews won't be requested
  until the two stages pass, as there's no point in spending human
  attention on things that are not going to be candidates for
  deployment to production

* other stages may run in parallel with code review, but if they fail
  they may nullify candidacy?

* deployments go to K8s, everything will run in containers

# Architecture: internals

# Acceptance criteria

* This chapter sketches some automated acceptance tests using a
  Gherkin/Cucumber-like pseudo code language.