Why Use a Monorepo?

January 15, 2018

A monorepo helps reduce the cost of software development. It does this in three different ways: by being simpler to use, by providing better discoverability, and by allowing atomicity of updates. Taking each of these in turn….

Simplicity

In the ideal world, all you’d need to do is clone your software repository, do a build, make an edit, put up a pull request, and then repeat the last three steps endlessly. Your CI system would watch the repository (possibly a directory or two within it), and kick off builds as necessary. Anything more is adding overhead and cost to the process.

That overheard starts being introduced when multiple separate codebases  need to be coordinated in some way. Perhaps there’s a protocol definition file that needs to be shared by more than one project. Perhaps there’s utility code that’s shared between more than one project.

In many organisations developers may not have the ability to set up a repo on demand, so there’s a time and political cost in creating one. Then there’s the ongoing cost of maintaining them, backing them up, and so on. Especially if data is being duplicated between repositories, the aggregate total space used by these repos will also be larger.

Multiple repositories are not necessarily “simple”.

One straw man solution to the problems of coordination is to copy all required dependencies into your own repo, but then we’ve a huge pile of duplicated work that opens up the possibility of parallel but incompatible changes being made at the same time.

A better solution is to build binary artefacts that are stored in some central location, and grab those when required. Bad experiences with storing binaries in the VCS make many people shy of just checking in the artefacts, so this storage solution seems attractive. But the alternatives introduce complexity. Where previously we only had to worry about maintaining the uptime of the source control system, there’s now the additional cost of maintaining this binary datastore, and ensuring its uptime too. Worse, in order to preserve historical builds, the binary datastore needs to be immutable after a write. In my experience, rather than being a directory served using nginx or similar, people turn to commercial solutions even when free alternatives are available. The cost of building and running this infrastructure raises the total cost of development.

Another area where monorepos bring simplicity is when a package or library needs to be extracted from existing code. This process is simple in a monorepo: just create the new directories, possibly after asking permission from someone, and check in. Every other user receives that change with their next update, without needing to re-run tooling to ensure that their patchwork clients are up to date. Outside of a monorepo, the process can be more painful, especially if a new repository is needed for the freshly extracted code.

Identifying every place that is impacted by such a code change is also easy in a monorepo, even if you’re not using a graph-based build tool such as bazel or buck, but doing something like “maven in a monorepo”. The graph-based build tools typically have a mechanism to query the build graph, but if the tree is one place and you don’t have code-insight tools, then even “grep” can get you quite far.

There are arguments about monorepos stressing source control software, requiring different tool chains, or not being compatible with CI systems. I addressed those concerns in an earlier post, but the TL;DR is “modern DVCS systems can cope with the large repos, you don’t need to change how you build code, and your CI pipelines can be left essentially ‘as is’.”

Discoverability

One of the ways that monorepos drive down the cost of software development is by reducing duplication of effort.

It’s a truism that the best code is the code that is never written. Every line of code that’s written imposes an ongoing cost of maintenance that needs to be paid until the code is retired from production (at the very earliest!). Fortunately, a good software engineer is a lazy software engineer — if they’re aware of a library or utility that can be used, they’ll use that.

In order to function properly, a monorepo needs to be structured to ease discoverability of reusable components, as covered in the post about organising a monorepo. One of the key supporting mechanisms is to separate the tree into functional areas. However, just because a monorepo is structured to aid discoverability, it doesn’t do anything to prevent “spaghetti dependencies” from appearing. What it does do is help surface these dependencies, which would exist in any case, without fancy additional tooling.

Naturally, a monorepo isn’t the only way of solving the problem of discovering code. Good code insight tooling can fill the same role (go Kythe!), as do central directories where people can find the code repositories that house useful shared code. Even hearsay and guesswork can suffice; after all, the Java world has coped with Maven Central for an incredibly long time.

Discovering code has other benefits. As a concrete example, it becomes possible to accurately scope the size of refactorings to APIs within an organisation: simply traverse the graph of everything impacted by a change, and make the change. What used to be a finger-in-the-air guess, or would require coordination across multiple repositories, becomes a far simpler exercise to measure. To actually perform the change? Well, there’s still politics to deal with. Nothing stops that.

Being able to identify all the locations that are impacted by any change makes CI infrastructure easier to write and maintain. After all, we use CI to answer the questions “is our code safe to deploy? And if not, why is it not safe?” In a monorepo, the graph of dependencies is easier to discover, and that graph can (and should!) be used to drive minimally-sized but provably correct builds, running all necessary build and test and not a single thing more. Needless to say, this means that less work is done by the CI servers, meaninging tighter feedback loops, and faster progress. Do you need a monorepo to build this graph? Of course not. Is building that infrastructure to replicate this something you’ve time to do? Probably not.

There is also nothing about using a monorepo that precludes putting useful metadata into the tree at appropriate points. Individual parts of the tree can include license information (particularly when importing third party dependencies), or READMEs that provide human-readable information about the purpose of a directory or package, and where to go for help. However, the need for some of this metadata (“how do I get the dependencies?”, “what’s the purpose of this package?”) can be significantly reduced by structuring the monorepo in a meaningful way.

Atomicity

Occasionally there are components that need to be shared between different parts of the system. Examples include IDL files, protobuf definitions, and other items that can be used to generate code, or must exist as a shared component between client and server.

Now, there’s reams to be written about how to actually manage updating message definitions in a world where there might be more than one version of that protocol in the wild, and having a monorepo doesn’t prevent you from needing to follow those rules and suggestions. What a monorepo allows is a definitive answer to the question of where these shared items should be. Traditionally, the answer has been:

Needless to say, the last approach is remarkably painful, since all changes to the definitions need to tracked across all repositories. In the first two cases, you may end up with unwanted dependencies on either client or server-side code. So the sensible thing to do is to store the shared item in a different repository. This will lead you to the horror of juggling multiple repositories, or, if you’re lucky, taking a dependency on a pre-built binary that someone else is responsible for building.

Interesting things happen when the shared item needs to be updated. Who is responsible for propagating the changes? Without a requirement to update, teams seldom update dependencies, so there’s out-of-band communication that needs to happen to enforce updates.

Using a monorepo resolves the problem. There’s one place to store the definition, everyone can depend on it as necessary, and updates happen atomically across the entire codebase (though it may take a long time for those changes to be reflected in production) The same logic applies to making small refactorings — the problem is easy to scope, and completion can be done by an individual working alone.

Summary

Monorepos can reduce the cost of software development. They’re not a silver bullet, and they require an organisation to practice at least a minimal level of collective code ownership. The approach worked well at Google and Facebook because those companies fostered an attitude that the codebase was a shared resource, that anyone could contribute to and improve.

For a company which prevents people from viewing everything and having a global view of the source tree, for whatever reason (commercial? Social? Internally competing teams?) a monorepo is a non-starter. That’s a pity, because there are considerable cost savings to be made as more and more share a monorepo. It’s also possible to implement a monorepo where almost everything is public, with parts selected pieces being made available as pre-compiled binaries or otherwise encrypted for most individuals.

Monorepos help reduce the cost of software over the lifetime of the code by simplifying the path to efficient CI, lowering the overhead of ensuring changes are propagated to dependent projects, and by reducing the effort required to extract new packages and components. As Will Robertson pointed out, they can also help reduce the cost of developing development support tooling by providing a single-point “API” to the VCS tool and the source code itself.

Complementary practices

Monorepos solve a whole host of problems, but, just as with any technical solution, there are tradeoffs to be made. Simply cargo-culting what Google, Facebook, or other public early adopters of the pattern have done won’t necessarily lead you to success. On the flip side of the coin, sticking with “business as usual” within a monorepo may not work either.

Although complex branching strategies might work in a monorepo, the sheer number of moving pieces means that the opportunity for merge conflicts increases dramatically. One of the practices that should be most strongly considered is adopting Trunk Based Development. This also suggests that developers work on short-lived feature branches, hiding work in progress behind feature flags.

Software development is a social activity. Merging many small commits without describing the logical change going in makes the shared resource of the repo’s logs harder to understand. This leads to a model that is less common than it used to be — squashing the individual steps that lead to a logical change to a single commit, which describes that logical change. This makes using the commit logs a useful resource too. Code review tools such as Phabricator help make this process simpler.

Most importantly: stop and think. It is unlikely your company is Google, Facebook, Twitter, Uber, or one of the other high-profile large companies that have already adopted monorepos (but if you’re reading from one of those places, “Hi!”). A monorepo makes a lot of sense, but simply aping the big beasts and hoping for the best won’t lead to happiness. Consider the advantages to your organisation for each step of the path towards a monorepo, and take those steps with your eyes open.

Thanks

Thank you to Nathan Fisher, Josh Graham, Paul Hammant, Felipe Lima, Dan North, Will Robertson, and Chris Stevenson for the suggestions and feedback while I was writing this post.

More recently The Selenium Server & Creating New Sessions     Organising a Monorepo Less recently