@Piotr Galar
We executed over 1,000,000 CI jobs for our clients in 2023! Is it a lot? I think you might already sense your favourite answer coming up shortly - IT DEPENDS! For a big corp, no, not really. For a popular project, sure, quite considerable. For a two-person team working exclusively with open-source orgs, hell yeah!
Or, at the very least, it’s a number significant enough to warrant looking back and reflecting on how our work affected developers’ day-to-day experience. Especially since it’s a story of how we made CI 4 times faster in the span of a year. Let’s dive in!
Improving CI at Scale is Hard
In 2023, we worked closely with Protocol Labs, the company driving initiatives such as IPFS, libp2p, and Filecoin.
What’s truly unique about Protocol Labs is that it works 100% in the open on GitHub. We wholeheartedly support this goal. It’s also a very exciting challenge for us to design and maintain CI that works across dozens of organizations, thousands of repositories, and thousands of internal and external contributors.
Look at the Data
Contrary to popular belief, the first step doesn’t always have to be the hardest. What do you do when faced with uncertainty? Collect data! That’s exactly what we did.
We created a GitHub Monitoring Dashboard to let us peek at what’s going on inside the GitHub orgs we were working with. We collected the GitHub Actions activity data by subscribing to GitHub events webhooks and storing them as is in an SQL database.
This, combined with a sleek Grafana dashboard, gave us insights into the actual load we were dealing with and allowed us to reason about the experience the developers were having with the existing systems.
The data undoubtedly confirmed what we found in the qualitative research. We would frequently talk to developers to gain personal insights about their experience. That’s the only way to gain a full picture of the situation and propose improvements that truly change people’s day-to-days. In this case, we found the core developers felt there was no point whatsoever in waiting for the CI, which means their work often spanned over many days or even weeks.
They would work on their change, push it to GitHub, create a pull request (PR), and then... nothing! For the next 20 minutes! That doesn’t even account for the execution time, and what if, on top of all that, the workflow they were waiting for consisted of more than one sequential job? That’s totally crazy and unacceptable. There’s no wonder they would often only come back to GitHub the next day to check back on it. CI should, at the very least, be predictable. Under 10 seconds to get a machine we achieved by the end of the year is much more like it.
The Developer Can Only Wait for So Long
The interviews with developers enabled us to establish the main pain point. The GitHub Monitoring Dashboard helped us put the magnitude of that pain in perspective with an X-ray-like precision. Finally, our expertise allowed us to issue a diagnosis and propose a solution.
Protocol Labs was consistently operating over GitHub’s hosted runner limits. When faced with such an issue, the natural first option would be to raise the limits by throwing money at the problem. However, considering the number of orgs and collaborators involved, that would have never been feasible. Instead, we proposed introducing Custom GitHub Runners into the mix.
We started hosting self-hosted runners on AWS using the solution developed by Philips Labs. We extended it with a custom request router to support many runner types varying in instance types, OSes, and machine resources. To ensure we could use them securely in the open, we configured the runners to operate in the ephemeral mode, i.e. they would never do more than one job during their lifespan. Finally, we wanted our custom runner types to become drop-in replacements for the hosted runners over time.
Thanks to these choices and the fact that all it took was a single line diff to switch to our Custom GitHub Runners infrastructure, the self-hosted runners quickly started accounting for almost a third of all jobs executed throughout the orgs we worked with.
For contributors, it meant that the double-digit job waiting times were a thing of the past. For us, now that we transitioned from insane to reasonable-ish, it was time to look beyond the availability problem and find other areas for improvement. After all, we do strive for excellence.
There are Many Ways to Speed Things Up
With “existential” issues out of the way, we could focus on optimisations. There are many ways in which one can tackle making CI more friendly for developers. I’m going to give you a quick overview of some of the techniques we successfully employed throughout 2023, and the impact they had on the developer experience.
Parallelization, i.e. Two for the Price of One
When I was younger, my mom would often tell me to finish what I was doing before starting the next activity. And while it is great advice when you want to start playing with your LEGOs while all your art supplies are still scattered around the floor, it doesn’t quite apply to computers.
Computers are way better at executing tasks concurrently than humans (or at the very least better than the strictly sequential yours truly). With a virtually unlimited fleet of self-hosted runners at hand, we decided to parallelise whatever we could.
There’s one example I’m particularly proud of. In rust-libp2p, we took a strictly sequential build process and broke it up into more than 60 parallel jobs. This drove down CI PR feedback waiting times from hours at the beginning of the year to under 10 minutes at the end. That’s the difference between avoiding making any changes and a healthy, collaborative environment.
Communication is Key
What good is a check on a PR if you have no idea what it represents? Worse yet, what if the check is lying to you?! That’s a very common sentiment we encounter in our line of work and organisations belonging to Protocol Labs were not an exception.
Flaky tests are, more often than not, the main offender here. They can seriously undermine developers’ trust in the CI. Luckily, we know some ways to tackle that problem! They range from being diligent in the face of flakiness through careful reporting on offending tests all the way to automated quarantines (or even fixes).
In 2023, we faced this problem time and time again, but one encounter definitely stood out. In Kubo, we dealt with a huge test suite written entirely in Bash. It was fundamental to the project but even the most experienced engineers would often be phased by it and suggest “fixing it” via reruns.
That wasn’t good enough for us. That’s why, throughout the year, we first improved the reporting from the test suite by generating interactive HTML test reports with little help from Ant, and finally, we migrated the project entirely from the Bash-based test framework to the Gateway Conformance Framework. This surely helped us break the 90% workflow success rate in the second half of the year.
Do not Leave Anyone Behind
At this point, you should be wondering how these changes even appear on high-level graphs encompassing data for dozens of orgs, thousands of reports, and millions of jobs.
Being sensible engineers, we started our work from places where developers would spend the majority of their time. These would be huge repositories with plenty of contributors like Kubo, rust-libp2p, or go-libp2p. This allowed us to learn their favourite patterns, familiar workflows, and common processes while maximising our impact. Finally, equipped with invaluable knowledge collected in practice, we were ready to SCALE!
We redesigned, developed and deployed Unified GitHub Workflows across all the orgs we were working with. The idea was to centralise the management and distribution of common workflows, such as building and testing Go packages or linting TypeScript projects, in order to optimise them once and reap the benefits everywhere.
It also created a point of entry for us through which we could automate many more chores that developers shouldn’t waste their precious time on. It made tasks like upgrading the Go or Node version completely disappear from developers’ backlogs forever.
Can we do 8x in 2024?
It’s but the tip of the iceberg of what we’ve been up to. Caching, right-sizing, virtualisation, reliability, flow state upkeep,… There are so many more amazing initiatives that I’d love to share with you. And so many more that we didn’t get to last year but to which we’re looking forward with passion, like better high-level lenses for looking at our data or comprehensive surveying for example. But if you have time to check out one thing, definitely have a look at the work we’ve done with our GitHub Monitoring Dashboard - it’s a real eye-opener.
DevEx is an amazing field of research and development, and I utterly love it. Helping developers, improving their workflows, and taking away frustrations and pain are why we do it. That’s also why, looking back at 2023, I cannot help but feel proud of our impact.
We completely changed the work-life quality for thousands of internal and external contributors, working across thousands of open-source repositories spanning dozens of organisations by making the CI 4X faster.
So thank you, 2023, and here’s to 2024 - may your CI always succeed in an instant 🍾
If you need any help making this wish come true, give us a call ☎️