Multi- to Monorepo Migration
This script merges multiple independent tiny repositories into a single “monorepo”. Every original repo is moved into its own subdirectory, branches with the same name are all merged. See Example for the details.
Download the tomono
script on github.com/hraban/tomono.
1. Features
- 🕙 Full history of all your prior repos is intact, no changes to checksums
- #️⃣ Signatures of old repos stay valid
- 🔁 Create the monorepo and keep pulling in changes from your minirepos later
- 🔀 Pull in entire new repos as you go, no need to prepare the whole thing at once
- 🏷 Tags are namespaced to avoid clashes, but tag signatures remain valid
- 🉑 Branches with weird names (slashes, etc)
- 👥 No conflicts between files with the same name
- 📁 Every project gets its own subdirectory
2. Usage
Run the tomono
script with your config on stdin, in the following format:
$ cat my-repos.txt git@github.com:mycompany/my-repo-abc.git abc git@github.com:mycompany/my-repo-def.git def git@github.com:mycompany/my-lib-uuu.git uuu lib/uuu git@github.com:mycompany/my-lib-zzz.git zzz lib/zzz https://gitee.com/shijie/zhongguo.git 中国
Concrete example:
$ cat my-repos.txt | /path/to/tomono
That should be all ✅.
2.1. Custom name for monorepo directory
Don’t like core
? Set a different name through an envvar before running the script:
export MONOREPO_NAME=the-big-repo
2.2. Custom “master” / “main” branch name
No need to do anything. This script does not handle any master / main branch in any special way. It just merges whatever branches exist. Don’t have a “master” branch? None will be created.
Make sure your own computer has the right branch set up in its init.defaultBranch
setting.
2.3. Continue existing migration
Large teams can’t afford to “stop the world” while a migration is in progress. You’ll be fixing stuff and pulling in new repositories as you go.
Here’s how to pull in an entirely new set of repositories:
/path/to/tomono --continue < my-new-repos.txt
Make sure you have your environment set up exactly the same as above. Particularly, you must be in the parent dir of the monorepo.
2.4. Tags
Tags are namespaced per remote, to avoid clashes. If your remote foo
and bar
both have a tag v1.0.0
, your monorepo ends up with foo/v1.0.0
and bar/v1.0.0
pointing at their relevant commits.
If you don’t like this rewriting, you can fetch all tags from a specific remote to the top-level of the monorepo:
$ git fetch --tags foo
Be prepared to deal with any conflicts.
2.4.1. Lightweight vs. Annotated Tags
N.B.: This namespacing works for all tags: lightweight, annotated, signed. However, for the latter two, there is one snag: an annotated tag contains its own tag name as part of the commit. I have chosen not to modify the object itself, so the annotated tag object thinks it still has its old name. This is a mixed bag: it depends on your case whether that’s a feature or a bug. One major advantage of this approach is that signed tags remain valid. But you will occasionally get messages like:
$ git describe linux/v5.9-rc4 warning: tag 'linux/v5.9-rc4' is externally known as 'v5.9-rc4' v5.9-rc4-0-gf4d51dffc6c0
If you know what you’re doing, you can force update all signed and annotated tags to their (nested) ref tag name with the following snippet:
git for-each-ref --format '%(objecttype) %(refname:lstrip=2)' | \ sed -ne 's/^tag //p' | GIT_EDITOR=true xargs -I + -n 1 -- git tag -f -a + +^{}
N.B.: this will convert all signed tags to regular annotated tags (their signatures would fail anyway).
Source: GitHub user mwasilew2.
3. Example
Run these commands to set up a fresh directory with git monorepos that you can later merge:
3.1. Initial setup of fake repos
d="$(mktemp -d)" echo "Setting up fresh multi-repos in $d" cd "$d" mkdir foo ( cd foo git init git commit -m "foo’s empty root" --allow-empty echo "This is foo" > i-am-foo.txt git add -A git commit -m "foo’s master" git tag v1.0 git checkout -b branch-a echo "I am a new foo feature" > feature-a.txt git add -A git commit -m "foo’s feature branch A" ) mkdir 中文 ( cd 中文 git init echo "你好" > 你好.txt git add -A git commit -m "中文的root" git tag v1.0 git checkout -b branch-a echo "你好 from feature-a" > feature-a.txt git add -A git commit -m "new 中文 feature branch A" git branch branch-b master git checkout branch-b echo "I am an entirely new 中文 feature: B" > feature-b.txt git add -A git commit -m "中文’s feature branch B" )
You now have two directories:
foo
(branches:master
,branch-a
)中文
(branches:master
,branch-a
,branch-b
)
3.2. Combine into monorepo
Assuming the tomono
script is in your $PATH
, you can invoke it like this, from that same directory:
tomono <<EOF $PWD/foo foo $PWD/中文 中文 EOF
This will create a new directory, core
, where you can find a git tree which looks somewhat like this:
* b742af2 Merge 中文/branch-a (branch-a) |\ | * c05c53c new 中文 feature branch A (中文/branch-a) * | a51d138 Merge foo/branch-a |\ \ | * | ebb490a foo’s feature branch A (foo/branch-a) * | | a08fa18 Root commit for monorepo branch branch-a / / | | * c53bf94 Merge 中文/branch-b (branch-b) | | |\ | | | * 5e7f4f5 中文’s feature branch B (中文/branch-b) | | |/ | |/| | | * 2738327 Root commit for monorepo branch branch-b | | | | * 9a4b33a Merge 中文/master (HEAD -> master) | | |\ | | |/ | |/| | * | a9841a8 中文的root (tag: 中文/v1.0, 中文/master) | / | * b75840e Merge foo/master | |\ | |/ |/| * | 1515265 foo’s master (tag: foo/v1.0, foo/master) * | f71fcde foo’s empty root / * 7803cf5 Root commit for monorepo branch master
3.3. Pull in new changes from a remote
It’s possible that while you’re working on setting up your fresh monorepo, new changes have been pushed to the existing single repos:
( cd foo echo New changes >> i-am-foo.txt git commit -va -m 'New changes to foo' )
Because their history was imported verbatim and nothing has been rewritten, you can import those changes into the monorepo.
First, fetch the changes from the remote:
$ cd core $ git fetch foo
Now merge your changes using subtree merge:
git checkout master
git merge -X subtree=foo/ foo/master
And the updates should be reflected in the monorepo:
$ cat foo/i-am-foo.txt This is foo New changes
I used the branch master in this example, but any branch works the same way.
3.4. Continue
Now imagine you want to pull in a third repository into the monorepo:
mkdir zimlib ( cd zimlib git init echo "This is zim" > i-am-zim.txt git add -A git commit -m "zim’s master" git checkout -b branch-a echo "I am a new zim feature" > feature-a.txt git add -A git commit -m "zim’s feature branch A" # And some more weird stuff, to mess with you git checkout master git checkout -d echo top secret > james-bond.txt git add -A git commit -m "I am unreachable" git tag leaking-you HEAD git checkout --orphan empty-branch git rm --cached -r . git clean -dfx git commit -m "zim’s tricky empty orphan branch" --allow-empty )
Continue importing it:
echo "$PWD/zimlib zim lib/zim" | tomono --continue
Note that we used a different name for this subrepo, inside the lib
dir.
The result is that it gets imported into the existing monorepo, alongside the existing two projects:
$ cd core $ git checkout master Switched to branch 'master' $ tree . ├── foo │ └── i-am-foo.txt ├── lib │ └── zim │ └── i-am-zim.txt └── 中文 └── 你好.txt 4 directories, 3 files $ git checkout branch-a Switched to branch 'branch-a' $ tree . ├── foo │ ├── feature-a.txt │ └── i-am-foo.txt ├── lib │ └── zim │ ├── feature-a.txt │ └── i-am-zim.txt └── 中文 ├── feature-a.txt └── 你好.txt 4 directories, 6 files $ head **/feature-a.txt ==> foo/feature-a.txt <== I am a new foo feature ==> lib/zim/feature-a.txt <== I am a new zim feature ==> 中文/feature-a.txt <== 你好 from feature-a
4. Implementation
(This section is best viewed in HTML form; the GitHub Readme viewer misses some info.)
The outer program structure is a flat bash script which loops over every repo supplied over stdin:
<<init>> # Note this is top-level in the script so it’s reading from the script’s stdin while <<windows-fix>> read -r repourl reponame repopath; do if [[ -z "$repopath" ]]; then repopath="$reponame" fi <<handle-remote>> done <<finalize>> # <<copyright>>
References: init, windows-fix, handle-remote, finalize, copyright
4.1. Per repository
Every repository is fetched and fully handled individually, and sequentially:
- fetch all the data related to this repository,
- immediately check out and initialise every single branch which belongs to that repository.
git remote add "$reponame" "$repourl" git config --add "remote.$reponame.fetch" "+refs/tags/*:refs/tags/$reponame/*" git config "remote.$reponame.tagOpt" --no-tags git fetch --atomic "$reponame" <<list-branches>> | while read -r branch ; do <<handle-branch>> done
References: list-branches, handle-branch
Used by: top-level
The remotes are configured to make sure that a default fetch always fetch all tags, and also puts them in their own namespace. The default refspec for tags is +refs/tags/*:refs/tags/*
, as you can see that puts everything from the remote at the same level in your monorepo. Obviously that will cause clashes, so we add the reponame as an extra namespace.
The --no-tags
option is the complement to --tags
, which has that default refspec we don’t want. That’s why we disable it and roll our own, entirely.
4.2. Per branch (this is where the magic happens)
In the context of a single repository, every branch is independently read into a subdirectory for that repository, and merged into the monorepo.
This is the money shot.
<<move-files-to-subdirectory>> <<ensure-on-target-branch-in-monorepo>> git read-tree --prefix "$repopath" "refs/remotes/$reponame/$branch" tree="$(git write-tree)" merge_commit="$(git commit-tree \ "$tree" \ -p "$branch" \ -p "$move_commit" \ -m "Merge $reponame/$branch")" git reset -q "$merge_commit"
References: move-files-to-subdirectory, ensure-on-target-branch-in-monorepo
Used by: handle-remote
Source: https://git-scm.com/book/en/v2/Git-Internals-Git-Objects
4.2.1. Move files to a subdirectory
The files are moved in a separate, isolated pre-merge step: this helps keep the merge commit a “pure” merge and helps git log --follow
heuristics.
git read-tree "$empty_tree" git read-tree --prefix "$repopath" "refs/remotes/$reponame/$branch" tree="$(git write-tree)" move_commit="$(git commit-tree \ "$tree" \ -p "refs/remotes/$reponame/$branch" \ -m "Move all files to $repopath/")"
Used by: handle-branch
4.2.2. Ensure we are on the right branch
In this snippet, we ensure that we are ready to merge fresh code from a subrepo into this branch: either we checkout an existing branch in the monorepo by this name, or we create a fresh one.
We are given the variable $branch
which is the final name of the branch we want to operate on. It is the same as the name of the branch in each individual target repo.
if ! git show-ref --verify --quiet "refs/heads/$branch"; then root_commit="$(git commit-tree \ "$empty_tree" \ -m "Root commit for monorepo branch $branch")" git branch -- "$branch" "$root_commit" fi git symbolic-ref HEAD "refs/heads/$branch" git reset -q
Used by: handle-branch
Instead of using git checkout --orphan
and trying to create a new empty commit from the index, we create the empty commit directly and point the new branch to it. Then, we read the branch, new or existing, into the index. Now we have the current index representing the branch, and HEAD pointing at the branch. This allows us to stay in the index and avoid the worktree.
Working with HEAD feels odd, and it requires using git reset
to update the branch, rather than git branch -f ...
, because the branch is checked out. This is still more reliable than not pointing HEAD at the branch, because HEAD is always pointing at some branch (e.g. “master”), so it is easier to just assume you’re always pointing at the “current” branch.
Sources:
4.2.3. Non-goal: merging into root
GitHub user @woopla proposed in #42 the ability to merge a minirepo into the monorepo root, as if you used .
as the subdirectory. We ended up not going for it, but it was interesting to investigate how to do this with git read-tree
. The closest I got was:
if [[ "$repopath" == "." ]]; then # Experimental—is this how git read-tree works? I find it very confusing. git read-tree "$branch" "$reponame/$branch" else git read-tree --prefix "$repopath" "$reponame/$branch" fi
I must to confess I find the git read-tree man page too daunting to fully stand by this. I mostly figured it out by trial and error. It seems to work?
If anyone could explain to me exactly what this tool is supposed to do, what those separate stages are (it talks about “stage 0” to “stage 3” in its 3 way merge), and how you would cleanly do this, just for argument’s sake, I’d love to know.
But, as it turned out, this tool already has a way to merge a repo into the root: just make it the monorepo, and use it as a target for a --continue
operation. That solves that.
4.3. Set up the monorepo directory
We create a fresh directory for this script to run in, or continue on an existing one if the --continue
flag is passed.
# Poor man’s arg parse :/ arg="${1-}" : "${MONOREPO_NAME:=core}" case "$arg" in "") if [[ -d "$MONOREPO_NAME" ]]; then >&2 echo "monorepo directory $MONOREPO_NAME already exists" exit 1 fi mkdir "$MONOREPO_NAME" cd "$MONOREPO_NAME" git init ;; "--continue") if [[ ! -d "$MONOREPO_NAME" ]]; then >&2 echo "Asked to --continue, but monorepo directory $MONOREPO_NAME doesn’t exist" exit 1 fi cd "$MONOREPO_NAME" if git status --porcelain | grep . ; then >&2 echo "Git status shows pending changes in the repo. Cannot --continue." exit 1 fi # There isn’t anything special about --continue, really. ;; "--help" | "-h" | "help") cat <<EOF Usage: tomono [--continue] For more information, see the documentation at "https://tomono.0brg.net". EOF exit 0 ;; *) >&2 echo "Unexpected argument: $arg" >&2 echo >&2 echo "Usage: tomono [--continue]" exit 1 ;; esac
Used by: init
Most of this rigmarole is about UI, and preventing mistakes. As you can see, there is functionally no difference between continuing and starting fresh, beyond mkdir
and git init
. At the end of the day, every repo is read in greedily, and whether you do that on an existing monorepo, or a fresh one, doesn’t matter: every repo name you read in, is in fact itself like a --continue
operation.
It’s horrible and kludgy but I just want to get something working out the door, for now.
4.4. List individual branches
I want a single branch name per line on stdout, for a single specific remote:
git branch -r --no-color --list "$reponame/*" --format "%(refname:lstrip=3)"
Used by: handle-remote
4.4.1. Implementations that didn’t make the cut
Solutions I abandoned, due to one short-coming or another:
git branch -r
with grep
The most straight-forward way to list branch names:
$ git branch -r bar/branch-a bar/branch-b bar/master foo/branch-a foo/master
This could be combined with
grep
to filter all branches for a specific remote, and filter out the name. It’s very close, but how do you reliably remove an unknown string?find .git/refs/hooks
( cd ".git/refs/remotes/$reponame" && find . -type f -mindepth 1 | sed -e s/..// )
Closer, but ugly, and I got reports that it missed some branches (although I was never able to repro)
git ls-remote
git ls-remote --heads --refs "$reponame" | sed 's_[^ ]* *refs/heads/__'
Originally suggested in a PR 39, I’ve decided not to use this because
git-ls-remote
actively queries the remote to list its branches, rather than inspecting the local state of whatever we just fetched. That feels like a race condition at best, and becomes very annoying if you’re dealing with password protected remotes or otherwise inaccessible repos.
4.5. Init & finalize
Initialization is what you’d expect from a shell script:
<<set-flags>> <<prep-dir>> empty_tree="$(git hash-object -t tree /dev/null)"
References: set-flags, prep-dir
Used by: top-level
On the other side, when done, update the working tree to whatever the current branch is to avoid any confusion:
git checkout .
Used by: top-level
4.5.1. Error flags, warnings, debug
Various sh flags allow us to control the behaviour of the shell: treat
any unknown variable reference as an error, treat any non-zero exit
status in a pipeline as an error (instead of only looking at the last
program), and treat any error as fatal and quit. Additionally, if the
DEBUGSH
environment variable is set, enable “debug” mode by echoing
every command before it gets executed.
set -euo pipefail ${DEBUGSH+-x} if ((BASH_VERSINFO[0] > 4 || (BASH_VERSINFO[0] == 4 && BASH_VERSINFO[1] >= 4))); then shopt -s inherit_errexit fi
Used by: init
Also contains a monstrosity which is essentially a version guard around the inherit_errexit
option, which was only introduced in Bash 4.4. Notably Mac’s default bash doesn’t support it so the version guard is useful.
4.5.2. Windows newline fix
On Windows the config file could contain windows newline endings (CRLF). Bash doesn’t handle those as proper field separators. Even on Windows…
We force it by adding CR as a field separator:
IFS=$'\r'"$IFS"
Used by: top-level
It can’t hurt to do this on other computers, because who has a carriage return in their repo name or path? Nobody does.
The real question is: why is this not standard in Bash for Windows? Who knows. I’d add it to my .bashrc if I were you 🤷♀️.
5. Tests
(This section is best viewed in HTML form; the GitHub Readme viewer misses some info.)
The examples from this document can be combined into a test script:
set -euo pipefail ${DEBUGSH+-x} if ((BASH_VERSINFO[0] > 4 || (BASH_VERSINFO[0] == 4 && BASH_VERSINFO[1] >= 4))); then shopt -s inherit_errexit fi # In tests always echo the command: set -x export DEBUGSH=true # The tomono script is tangled right next to the test script export PATH="$PWD:$PATH" # Ensure testing always works even on unconfigured CI etc export GIT_AUTHOR_NAME="Test" export GIT_AUTHOR_EMAIL="test@test.com" export GIT_COMMITTER_NAME="Test" export GIT_COMMITTER_EMAIL="test@test.com" d="$(mktemp -d)" echo "Setting up fresh multi-repos in $d" cd "$d" mkdir foo ( cd foo git init git commit -m "foo’s empty root" --allow-empty echo "This is foo" > i-am-foo.txt git add -A git commit -m "foo’s master" git tag v1.0 git checkout -b branch-a echo "I am a new foo feature" > feature-a.txt git add -A git commit -m "foo’s feature branch A" ) mkdir 中文 ( cd 中文 git init echo "你好" > 你好.txt git add -A git commit -m "中文的root" git tag v1.0 git checkout -b branch-a echo "你好 from feature-a" > feature-a.txt git add -A git commit -m "new 中文 feature branch A" git branch branch-b master git checkout branch-b echo "I am an entirely new 中文 feature: B" > feature-b.txt git add -A git commit -m "中文’s feature branch B" ) mkdir zimlib ( cd zimlib git init echo "This is zim" > i-am-zim.txt git add -A git commit -m "zim’s master" git checkout -b branch-a echo "I am a new zim feature" > feature-a.txt git add -A git commit -m "zim’s feature branch A" # And some more weird stuff, to mess with you git checkout master git checkout -d echo top secret > james-bond.txt git add -A git commit -m "I am unreachable" git tag leaking-you HEAD git checkout --orphan empty-branch git rm --cached -r . git clean -dfx git commit -m "zim’s tricky empty orphan branch" --allow-empty ) tomono <<EOF $PWD/foo foo $PWD/中文 中文 EOF echo "$PWD/zimlib zim lib/zim" | tomono --continue ( cd core echo "Checking branch list" diff -u <(git branch --no-color --list --format "%(refname:lstrip=2)" | sort) <(cat <<EOF branch-a branch-b empty-branch master EOF ) echo "Checking master" git checkout master diff -u <(find . -name '*.txt' | sort | xargs head) <(cat <<EOF ==> ./foo/i-am-foo.txt <== This is foo ==> ./lib/zim/i-am-zim.txt <== This is zim ==> ./中文/你好.txt <== 你好 EOF ) echo "Checking branch-a" git checkout branch-a diff -u <(find . -name '*.txt' | sort | xargs head) <(cat <<EOF ==> ./foo/feature-a.txt <== I am a new foo feature ==> ./foo/i-am-foo.txt <== This is foo ==> ./lib/zim/feature-a.txt <== I am a new zim feature ==> ./lib/zim/i-am-zim.txt <== This is zim ==> ./中文/feature-a.txt <== 你好 from feature-a ==> ./中文/你好.txt <== 你好 EOF ) ) mkdir duplicates ( cd duplicates git init -b check-dupes echo a > a echo b > b git add -A git commit -m commit1 a git tag check-dupes git commit -m commit2 b ) echo "$PWD/duplicates duplicates" | tomono --continue ( cd core git checkout check-dupes # This file must exist diff -u duplicates/a <(echo a) # This file too diff -u duplicates/b <(echo b) )
All we needed to write was the code that actually evaluates the tests and fixtures.
I use that weird diff -u <(..)
trick instead of a string compare like [[ "foo" == "..." ]]
, because the diff shows you where the problem is, instead of just failing the test without comment.
5.1. Edge case: same branch and tag name
If you have a branch and tag with the same name in a git repo, you will be familiar with this error:
warning: refname ’foo’ is ambiguous.
See #53. This happens whenever you refer to the tag or branch by its bare name, without specifying whether it’s a tag or a branch. To fix this, the monorepo script must always use refs/heads/...
to specify the branch name.
Example:
mkdir duplicates ( cd duplicates git init -b check-dupes echo a > a echo b > b git add -A git commit -m commit1 a git tag check-dupes git commit -m commit2 b )
We now have a duplicates
repository with a branch and tag check-dupes
, pointing at different revisions. After including it in the monorepo:
echo "$PWD/duplicates duplicates" | tomono --continue
We should get:
( cd core git checkout check-dupes # This file must exist diff -u duplicates/a <(echo a) # This file too diff -u duplicates/b <(echo b) )
6. Copyright and license
This is a cleanroom reimplementation of the tomono.sh script, originally written with copyright assigned to Ravelin Ltd., a UK fraud detection company. There were some questions around licensing, and it was unclear how to go forward with maintenance of this project given its dispersed copyright, so I went ahead and rewrote the entire thing for a fresh start.
The license and copyright attribution of this entire document can now be set:
Copyright © 2020, 2022–2024 Hraban Luyat This program is free software: you can redistribute it and/or modify it under the terms of the GNU Affero General Public License as published by the Free Software Foundation, version 3 of the License. This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU Affero General Public License for more details. You should have received a copy of the GNU Affero General Public License along with this program. If not, see <https://www.gnu.org/licenses/>.
Used by: top-level
I did not look at the original implementation at all while developing this.