What are some other approaches considered and how we deal with common testing problems?
If you have not read it yet part 1 gives some good background to this to this post.
Obligatory: the views expressed in this blog are my own and do not constitute those of my employer (Ocient).
Note: this is not exhaustive of all other ways to successfully implement CI.
There are two main issues with the approach described in my last post.
- You have to wait for however long CI takes before you can merge.
- Two conflicting changes can merge at the same time that each work by themselves but once merged together they fail. See diagram below.
While these conflicts do happen from time to time it is rare and caught fast so we have not put a lot of effort into solving this problem. However, as Ocient grows we expect to see more of this problem in the future and we might eventually move to a merge-trains like approach.
While I am sure there is a more official name for this. In this model every dev can merge to main pending some reviews and an extremely slim CI run. Once merged, CI is run and if there are problems the merge is automatically reverted (or if the test is found to be buggy then the test is disabled instead).
While this approach has its merits (and there are a lot of good blogs about this) this causes problems for us since our long pole test time is about 1hr (and that is when there is no queuing). This leaves a big gap when broken code could be pushed and when it could be reverted. Another dev can come along and rebase on the broken code during that time.
This can be fixed by using merge trains (see below).
Another issue is that at Ocient often a developer will push up code to confirm their work is correct with no intention to merge the code just to get a CI run on it.
This can be fixed by allowing a developer to deploy tests to Nomad (just like CI). While this is possible most devs find this annoying since it ties up an terminal window and a repo while being run (again something that could be fixed).
These are similar to merge first ask questions later except they solve the problem of ever having broken code in main. Here is a pretty good explanation of the idea. Note this is a gitlab specific implementation and the same thing can be accomplished without using gitlab.
Another problem that plagues CI systems and testing in general is flaky tests. We reduce test flakiness with nomad. We take all our existing tests and run them 100 times each. We can measure their flakiness by how often they fail. Due to limited hardware stop just short of automating this process. However, if we suspect a test is flaky we use this tool to determine how flaky it is.
If a test is determined to be too flakey we disable it until the owner can fix the test.
This is something we have just started doing and I will update this post as we find the most effective ways to deal with the problem.
Ocient is hiring for all sorts of roles across development. If you are interested in working on build systems or any other aspect of distributed database apply and drop me an email to let me know you applied.