Improving CI at Scale is Hard

In 2023, we worked closely with Protocol Labs, the company driving initiatives such as IPFS, libp2p, and Filecoin.

What’s truly unique about Protocol Labs is that it works 100% in the open on GitHub. We wholeheartedly support this goal. It’s also a very exciting challenge for us to design and maintain CI that works across dozens of organizations, thousands of repositories, and thousands of internal and external contributors.

Look at the Data

Contrary to popular belief, the first step doesn’t always have to be the hardest. What do you do when faced with uncertainty? Collect data! That’s exactly what we did.

We created a GitHub Monitoring Dashboard to let us peek at what’s going on inside the GitHub orgs we were working with. We collected the GitHub Actions activity data by subscribing to GitHub events webhooks and storing them as is in an SQL database.

🔥

Hot Tip When collecting events, save them as they come. Chances are you don’t know how you’ll want to process that data in the future.

This, combined with a sleek Grafana dashboard, gave us insights into the actual load we were dealing with and allowed us to reason about the experience the developers were having with the existing systems.

The data undoubtedly confirmed what we found in the qualitative research. We would frequently talk to developers to gain personal insights about their experience. That’s the only way to gain a full picture of the situation and propose improvements that truly change people’s day-to-days. In this case, we found the core developers felt there was no point whatsoever in waiting for the CI, which means their work often spanned over many days or even weeks.

They would work on their change, push it to GitHub, create a pull request (PR), and then... nothing! For the next 20 minutes! That doesn’t even account for the execution time, and what if, on top of all that, the workflow they were waiting for consisted of more than one sequential job? That’s totally crazy and unacceptable. There’s no wonder they would often only come back to GitHub the next day to check back on it. CI should, at the very least, be predictable. Under 10 seconds to get a machine we achieved by the end of the year is much more like it.

The Developer Can Only Wait for So Long

The interviews with developers enabled us to establish the main pain point. The GitHub Monitoring Dashboard helped us put the magnitude of that pain in perspective with an X-ray-like precision. Finally, our expertise allowed us to issue a diagnosis and propose a solution.

Protocol Labs was consistently operating over GitHub’s hosted runner limits. When faced with such an issue, the natural first option would be to raise the limits by throwing money at the problem. However, considering the number of orgs and collaborators involved, that would have never been feasible. Instead, we proposed introducing Custom GitHub Runners into the mix.

We started hosting self-hosted runners on AWS using the solution developed by Philips Labs. We extended it with a custom request router to support many runner types varying in instance types, OSes, and machine resources. To ensure we could use them securely in the open, we configured the runners to operate in the ephemeral mode, i.e. they would never do more than one job during their lifespan. Finally, we wanted our custom runner types to become drop-in replacements for the hosted runners over time.

🔥

Hot Tip: When using self-hosted runners in open-source projects, always make sure your workflows still work on regular GitHub runners. Fork authors will not have access to your internal infrastructure.

Thanks to these choices and the fact that all it took was a single line diff to switch to our Custom GitHub Runners infrastructure, the self-hosted runners quickly started accounting for almost a third of all jobs executed throughout the orgs we worked with.

For contributors, it meant that the double-digit job waiting times were a thing of the past. For us, now that we transitioned from insane to reasonable-ish, it was time to look beyond the availability problem and find other areas for improvement. After all, we do strive for excellence.

There are Many Ways to Speed Things Up

With “existential” issues out of the way, we could focus on optimisations. There are many ways in which one can tackle making CI more friendly for developers. I’m going to give you a quick overview of some of the techniques we successfully employed throughout 2023, and the impact they had on the developer experience.

Parallelization, i.e. Two for the Price of One

When I was younger, my mom would often tell me to finish what I was doing before starting the next activity. And while it is great advice when you want to start playing with your LEGOs while all your art supplies are still scattered around the floor, it doesn’t quite apply to computers.

Computers are way better at executing tasks concurrently than humans (or at the very least better than the strictly sequential yours truly). With a virtually unlimited fleet of self-hosted runners at hand, we decided to parallelise whatever we could.

🔥

Hot Tip When jobs have no common dependencies, execute them in parallel! If they do, it might be a good idea to extract the creation of common artefacts for another job. However, if the dependencies are not that costly to build, it’s often worth duplicating work for the sake of simplicity.

There’s one example I’m particularly proud of. In rust-libp2p, we took a strictly sequential build process and broke it up into more than 60 parallel jobs. This drove down CI PR feedback waiting times from hours at the beginning of the year to under 10 minutes at the end. That’s the difference between avoiding making any changes and a healthy, collaborative environment.

Communication is Key

What good is a check on a PR if you have no idea what it represents? Worse yet, what if the check is lying to you?! That’s a very common sentiment we encounter in our line of work and organisations belonging to Protocol Labs were not an exception.

Flaky tests are, more often than not, the main offender here. They can seriously undermine developers’ trust in the CI. Luckily, we know some ways to tackle that problem! They range from being diligent in the face of flakiness through careful reporting on offending tests all the way to automated quarantines (or even fixes).

In 2023, we faced this problem time and time again, but one encounter definitely stood out. In Kubo, we dealt with a huge test suite written entirely in Bash. It was fundamental to the project but even the most experienced engineers would often be phased by it and suggest “fixing it” via reruns.

That wasn’t good enough for us. That’s why, throughout the year, we first improved the reporting from the test suite by generating interactive HTML test reports with little help from Ant, and finally, we migrated the project entirely from the Bash-based test framework to the Gateway Conformance Framework. This surely helped us break the 90% workflow success rate in the second half of the year.

Do not Leave Anyone Behind

At this point, you should be wondering how these changes even appear on high-level graphs encompassing data for dozens of orgs, thousands of reports, and millions of jobs.

Being sensible engineers, we started our work from places where developers would spend the majority of their time. These would be huge repositories with plenty of contributors like Kubo, rust-libp2p, or go-libp2p. This allowed us to learn their favourite patterns, familiar workflows, and common processes while maximising our impact. Finally, equipped with invaluable knowledge collected in practice, we were ready to SCALE!

We redesigned, developed and deployed Unified GitHub Workflows across all the orgs we were working with. The idea was to centralise the management and distribution of common workflows, such as building and testing Go packages or linting TypeScript projects, in order to optimise them once and reap the benefits everywhere.

It also created a point of entry for us through which we could automate many more chores that developers shouldn’t waste their precious time on. It made tasks like upgrading the Go or Node version completely disappear from developers’ backlogs forever.

Can we do 8x in 2024?

It’s but the tip of the iceberg of what we’ve been up to. Caching, right-sizing, virtualisation, reliability, flow state upkeep,… There are so many more amazing initiatives that I’d love to share with you. And so many more that we didn’t get to last year but to which we’re looking forward with passion, like better high-level lenses for looking at our data or comprehensive surveying for example. But if you have time to check out one thing, definitely have a look at the work we’ve done with our GitHub Monitoring Dashboard - it’s a real eye-opener.

DevEx is an amazing field of research and development, and I utterly love it. Helping developers, improving their workflows, and taking away frustrations and pain are why we do it. That’s also why, looking back at 2023, I cannot help but feel proud of our impact.

We completely changed the work-life quality for thousands of internal and external contributors, working across thousands of open-source repositories spanning dozens of organisations by making the CI 4X faster.

So thank you, 2023, and here’s to 2024 - may your CI always succeed in an instant 🍾

If you need any help making this wish come true, give us a call ☎️

2023: The Year We Made CI 4x Faster