This is the first part of a multi-part series which is a deep dive into Ocient’s Bazel-based build system. This post covers the motivation behind transitioning to Bazel and how we manage external dependencies.
Obligatory: the views expressed in this blog are my own and do not constitute those of my employer (Ocient).
During this project I read a ton of blogs about Bazel. It seemed like it was the cure to all ailments. Hermetic and faster builds? Sign me up! Everything anyone needed to know about Bazel was already out there. What can I add?
What I quickly found is that many people who had written these blogs used contrived examples to show “how cool Bazel was” by building a hello world like program. It was either that or they worked at an organization that had strict C++ standards on compilation time and didn’t push the C++ compilation to the limit (more on this later). Ocient was in race mode when significant portions of our codebase was written which means we have some… how do I put this gently… “legacy code”. We needed a flexible build system that could scale well and would handle some of our very specific build requirements (see more on this later).
My goal of writing this series is to talk about a “real-life” switch to Bazel with a project that has many external dependencies, extremely nuanced build system, and very custom solutions to building C++ code. I hope to speak fairly technically about work we did to make everything build so you can use these design patterns and ideas to facilitate your switch to Bazel, speed up your Bazel builds, or just to get inspiration.
Here are some stats about the Ocient codebase.
Over 3,000,000 lines of C++ code
Just under 100,000 lines of Python code
Over 70,000 commits authored by more than 100 people
Before this project was started our build infrastructure was Make based with a distcc backend. A developer would run a command like
make all -j 100 which meant “build everything and do 100 compilations actions at the same time (and do them remotely if slots remain)”.
Our project was divided into 20+ libraries where each library contained related code and unit tests. The dependencies of the main binary target (“rolehostd”) are shown below. Do not worry too much about the names of the libraries. The point of this graph is to illustrate we have a reasonable list of internal dependencies.
We had three poorly named build types as well as a variety of subtypes.
RELEASE builds are the most aptly named. They have all the optimizations (
-O2and others) we support and all the asserts are compiled out. To get a sense of how many optimizations we have I have attached a list of some of our largest object files. Our top three object files sizes in this mode 1264MB, 1177MB, and 1173MB. With
-j36it took over 3 hours to build.
TEST_RELEASE are the same as release builds with asserts turned on and a few of our egregious inlining optimizations turned off. These are the builds we use for most of CI since they provide a good tradeoff between program execution speed, compile time, and correctness.
DEBUG are builds without any optimizations (
-O0(note we are investigating the use of
-O0docs)). In addition these builds use shared objects for each library for faster linking. More on this later. These builds also include an address sanitizer and a memory sanitizer. This is the default build type.
Here are some examples of a few subtypes of builds as well. I am not including all of them for brevity.
NO_SHARED is only used with DEBUG builds and it signifies to build without the address sanitizer, memory sanitizer and statically link all libraries rather than building shared objects. Great naming!
NO_DISTCC means do not use our distributed build system, instead run locally.
DEBUG_CONTAINERS means to build with debug containers that have asserts in them to check for proper usage. This found extremely serious bugs in our code and I would highly recommend you use this in your organization.
Putting this altogether a sample command might look like
make all DEBUG_CONTAINERS=y which means “make all with DEBUG_CONTAINERS and use a DEBUG build”
For those familiar with Make you might be wondering why I am specifying
all rather than a specific target (e.g.
binary-target). Without going into too much detail, it is because the impacts of other things included in
all are negligible. The main thing we are building we call “rolehostd” but for the purposes of this post I am just going to call it our “binary target”. In addition, each library also produces a gtest binary that runs all the unit tests in that library.
I’m sure people are going to read this and say “why don’t you do X instead of Y”. However, for each one of these changes the scope can creep and the criteria to call this project done grows and grows. It was important to have as minimal impacts outside of the build system as possible. Many of these hard requirements crept up on Ocient. For example, we would never test building with a different version of gcc. Lo and behold when we are no longer continuously testing we eventually lose compatibility.
- Stripping debug symbols from many object files because our binary was so large that it would fail to link. Sections need to be within 2GB of each other since there were only 32 bit offsets. This is worthy of a post by itself.
- We use linker scripts to move sections around in the binary since these addresses can conflict with the memory regions the address sanitizer reserves.
- We build with -mcmodel=large. This is also worthy of a post by itself.
- We require gcc 8.2 with a patch. We are using features in C++20. Downgrading or upgrading gcc would be a non-trivial amount of work.
- DEBUG builds required creating a shared object for each library.
- We use gdb 10.
- Our developer containers run Ubuntu 16.04.
So hopefully by now you have some background. Now lets talk about why there were problems. At the time this project started we had about 44 employees in development roles (not counting interns). By the time the project was finished we had about 55 employees (an increase of 20%). We were already starting to see the age of the build system. Frequently, during times of high demand, we would see compile actions time out on remote nodes and then have to be run locally. Since our hardware is shared this had a hugely negative impact on our CI nodes as well as developer containers. For example, if a developer built when our distcc build farm was saturated, the other people on the shared machine would notice a significant slowdown and a significant increase in memory usage. To minimize saturating the distcc build farm, we lowered the concurrency of builds in CI and it was not unheard of for CI to take longer than 8 hours.
Other than slowness, there were a other problems with Make/distcc.
- Distcc offered no support for debug symbol fissioning. This was the primary driver to upgrade. We needed support for this because we needed a way to keep our massive amounts of debugging information.
- No caching of artifacts could be shared between developers. When we tried with ccache, we found bugs at scale.
- Little in the way of elasticity which implied often resources were underused.
- Semi-frequent build errors that were due to a stale build artifact sitting around. Often we would have to tell people to blow away all their build artifacts and like magic it would just start to work but waste a lot of their time.
With all that being said it was time to find a new solution.
Besides solving the problems with Make/distcc Bazel comes with some other goodies.
- Support for debug symbol fissioning
- Remote execution and cache (for compilation and test actions)
- Better resource utilization (will get into this in a later post)
- Build sandbox
- Excellent built in profiling
- Potential for auto-scaling based on demand
- Linking could be done on the build farm
In our Make based system there was a bootstrapping step when building. First, a new developer would have to build our toolchain a 2-3 hour process that they had to go though on their first day (boring!). Then once the toolchain was built then they could use the tools built in the toolchain to actually build our system. The real toolchain pain came when something in the toolchain was updated. Every single developer would have to remake their toolchain. Depending on the change a developer might have to blow away all existing object files and archives. Many many times developers would skip that step and have nightmarish linker bugs since a stale object was being linked with an incompatible different object. Our toolchain consisted of two things, the tools to build Ocient, think g++, and some external dependencies of which some were linked and some where not (think curl and gdb). These external dependencies were versioned by git sha and stored on our internal servers. Secondly, there were additional external dependencies were kept outside of toolchain locked to a specific commit hash (you know… to keep everything simple /s). Before we made any progress on Bazel we had to port our toolchain and external dependencies to Bazel.
We noted that every developer was building the same set of tools. We could build this once and have every developer clone this down. The building of toolchain can be an automated job whenever our toolchain changes. We used Make since this allowed us to reuse some of the existing toolchain code. We spent the time to migrate all our toolchain and ext dependencies to our Bazel toolchain. The new process was to just run a rclone command and a new developer was ready to go.
This worked well until it didn’t. We found we were frequently updating our external dependencies and each time we did we would have to deploy a new toolchain while supporting the old one until no one used it anymore. By itself this is just horrible for us (and great for the rest of the development organization), we would have anywhere from 10 minute to 8 hour iteration times for toolchain changes. In the worst case we would need to rebuild toolchain and our binary only to discover there was something that needed to be changed in the toolchain. This leads us to the second approach.
This was the first point where I was disappointed with Bazel docs and the literature on the subject of external dependencies. It appeared to me, that the suggested way of dealing with external deps was to replace the project’s build system with a
BUILD.bazel and a
WORKSPACE file. I assume this works great for organizations the size of Google since you can afford to spend the development effort on things like this. However, for Ocient, every second I spent working on rewriting a build system for an open source piece of software was a second not spent adding features to Ocient. This was made even more frustrating since these projects already have a working build system.
We have 47 external C++ dependencies (60+ including tools built)
Of those 7 are header-only (require no building)
In Bazel’s default approach we need to maintain a build system for 40 external dependencies
What if we could build most of our external dependencies in Bazel and leave only a select few things in our toolchain while utilizing the external project’s existing build system? This avoids the problem of frequent toolchain upgrades while still getting caching and hermetic builds. There is another good blog post detailing this here (scroll to the very bottom of the article). The solution uses rules_foreign_cc to wrap non Bazel build system in Bazel by listing the artifacts we are expecting. For headers output by libraries you only need to specify a folder rather than every file in the folder (which is huge). This is the approach we are still using today and I would recommend you use.
I’m not claiming we have the most complicated dependency system but I do think it is more complicated than the average Bazel blog post which is why I am going to share the code for some our our dependencies.
Lets start simple. This is a fairly popular C++ library known as TCLAP. This is very simple because it is a header only library.
Lets knock it up one notch. This is another a real life example for jsoncpp.
Another example. This is for DPDK. I skipped the changes to the WORKSPACE file since they are similar (except we have more patches for DPDK). Warning, it gets grosser!
That is it for part 1! If you have any feedback or see any mistakes please drop me a line. The upcoming posts will cover the
BUILD.bazel files we use to actually build our code, how we deal with protobuf dependencies, how we do remote execution, and any other topics people are interested in.
Ocient is hiring for all sorts of roles across development. If you are interested in working on build systems or any other aspect of distributed database apply and drop me an email to let me know you applied.