Wednesday, 22 November 2017

Organising a Monorepo

How should a monorepo be organised? It only takes a moment to come up with many competing models, but the main ones to consider are “by language”, “by project”, “by functional area”, and “nix style”. Of course, it’s entirely possible to blend these approaches together. As an example, my preference is “primarily language-based, secondarily by functional area”, but perhaps I should explain the options.

Language-based monorepos
These repos contain a top-level directory per language. For languages that are typically organised into parallel test and source trees (I’m looking at you, Java) there might be two top-level directories.

Within the language specific tree, code is structured in a way that is unsurprising to “native speakers” of that language. For Java, that means a package structure based on fully-qualified domain names. For many other languages, it makes sense to have a directory per project or library.

Third party dependencies can either be stored within the language-specific directories, or in a separate top-level directory, segmented in the same language specific way.

This approach works well when there aren’t too many languages in play. Organisation standards, such as those present in Google, may limit the number of languages. Once the number of languages becomes too many, it becomes hard to determine where to start looking for the code you may depend on.

Project-based monorepos
One drawback with a language-based monorepo is that it’s increasingly common to use more than one language per project. Rather than spreading code across multiple locations, it’s nice to co-locate everything needed for a particular project in the same directory, with common code being stored “elsewhere”. In this model, therefore, there are multiple top-level directories representing each project.

The advantage with this approach is that creating a sparse checkout is incredibly simple: just clone the top-level directory that contains the project, et voila! Job done! It also makes removing dead code simple --- just delete the project directory once it’s no longer needed, and everything is gone. This same advantage means that it’s easy to export a cell as an Open Source project.

The disadvantage with project-based monorepos is that the top level can quickly become bloated as more and more projects are added. Worse, there's the question of what to do when projects are mostly retired, or have been refactored to mostly slivers of their former glory.

Functional area-based monorepos
A key advantage of monorepos is “discoverability”. It’s possible to organise a monorepo to enhance this, by grouping code into functional areas. For example, there might be a directory for “crypto” related code, another for “testing”, another for “networking” and so on. Now, when someone is looking for something they just need to consider the role it fulfills, and look at the tree to identify the target to depend on.

One way to make this approach fail miserably is to make extensive use of code names. “Loki” may seem like a cool project name (it’s not), but I’ll be damned if I can tell what it actually does without asking someone. Being developers, we need snazzy code names at all times, and by all means organise teams around those, but the output of those projects should be named in as a vanilla a way as possible: the output of “loki” may be a “man in the middle ssl proxy”, so stick that in “networking/ssl/proxy”. Your source tree should be painted beige --- the least exciting colour in the world.

Another problem with the functional area-based monorepos is that considerable thought has to be put into their initial structure. Moving code around is possible (and possible atomically), but as the repo grows larger the structure tends to ossify, and considerable social pressure needs to be overcome to make those changes.

Nix-style monorepos
Nix is damn cool, and offers many capabilities that are desirable for a monorepo being run in a low-discipline (or high-individuality) engineering environment, incapable of managing to keep to only using (close to a) single version of each dependency. Specifically, a nix-based monorepo actively supports multiple versions of dependencies, with projects depending on specific versions, and making this clear in their build files.

This differs from a regular monorepo with a few alternate versions of dependencies that are particularly taxing to get onto a single version (*cough* ICU *cough*) because multiple versions of things are actively encouraged, and dependencies need to be more actively managed.

There are serious maintainability concerns when using the nix-style monorepo, especially for components that need to be shared between multiple projects. Clean up of unused cells, mechanisms for migrating projects as dependencies update, and stable and fast constraint solving all need to be in place. Without those, a nix-style monorepo will rapidly become an ungovernable mess.

The maintainability issue is enough to make this a particularly poor choice. Consider this the “anti-pattern” of monorepo organisation.

Blended monorepos
It’s unlikely that any monorepo would be purely organised along a single one of these lines; a hybrid approach is typically simpler to work with. These “blended monorepos” attempt to address the weaknesses of each approach with the strengths of another.

As an example, project-based monorepos rapidly have a cluttered top-level directory. However, by splitting by functional area, or language and then functional area, the top-level becomes less cluttered and simultaneously easier to navigate.

For projects or dependencies that are primarily in one language, but with support libraries for other languages, take a case-by-case approach. For something like MySQL, it may make sense to just shovel everything into “c/database/mysql”, since the java library (for example) isn’t particularly large. For other tools, it may make more sense to separate the trees and stitch everything together using the build tool.

Third party dependencies
There is an interesting discussion to be had about where and how to store third party code. Do you take binary dependencies, or pull in the source? Do you store the third party code in a separate third party directory, or alongside first party code? Do you store the dependencies in your repository at all, or push them to something like a Maven artifact repository.

The temptation when checking in the source is that it becomes very easy to accidentally start maintaining a fork of whichever dependency it is. After all, you find a bug, and it’s sooo easy to fix it in place and then forget (or not be allowed) to upstream the fix. The advantage of checking in the source is that you can build from source, allowing you to optimise it as along with the rest of the build. Depending on your build tool, it may be possible to only rely on those parts of the library that are actually necessary for your project.

Checking in the binary artifacts has the disadvantage that source control tools are seldom optimised for storing binaries, so any changes will cause the overall size of the repository to grow (though not a snapshot at a single point in time) The advantage is that build times can be significantly shorter (as all that needs to be done is link the dependency in)

Binary dependencies pulled from third parties can be significantly easier to update. Tools such as maven, nuget, and cocoapods can describe a graph of dependencies, and these graphs can be reified by committing them to your monorepo (giving you stable, repeatable historical builds) or left where they lie and pulled in at build time. As one of the reviewers of this post pointed out, this requires the community the binaries are being pulled from to be well managed: releases must not be overwritten (which can be verified by simple hash checks), and snapshots should be avoided.

Putting labels on these, there are in-tree dependencies and externally managed dependencies, and both come in source and binary flavours.


My thanks to Nathan Fisher, Josh Graham, Will Robertson, and Chris Stevenson for their feedback while writing this post. Some of the side conversations are worth a post all of their own!