Skip to content

Commit c44c585

Browse files
authored
Merge pull request git#391 from newren/master
rn-54: add an introduction to git-filter-repo
2 parents 39bc34a + a0f39e6 commit c44c585

File tree

1 file changed

+247
-0
lines changed

1 file changed

+247
-0
lines changed

rev_news/drafts/edition-54.md

Lines changed: 247 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -78,6 +78,253 @@ This edition covers what happened during the month of July 2019.
7878
### Support
7979
-->
8080

81+
## An Introduction to git-filter-repo
82+
83+
There is a new tool available for surgery on git repositories:
84+
[git-filter-repo](https://github.com/newren/git-filter-repo). It
85+
claims to have [many new unique
86+
features](https://github.com/newren/git-filter-repo#design-rationale-behind-filter-repo-why-create-a-new-tool),
87+
[good
88+
performance](https://public-inbox.org/git/CABPp-BGOz8nks0+Tdw5GyGqxeYR-3FF6FT5JcgVqZDYVRQ6qog@mail.gmail.com/),
89+
and an ability to scale -- from making simple history rewrites
90+
trivial, to facilitating the creation of entirely new tools which
91+
leverage existing capabilities to handle more complex cases.
92+
93+
You can read more about [common usecases and base capabilities of
94+
filter-repo](https://github.com/newren/git-filter-repo/blob/ae43a0ef6d2c7af8f38c5bba38ca0b22942463cf/Documentation/git-filter-repo.txt#L17-L55),
95+
but in this article, I'd like to focus on two things: providing a simple
96+
example to give a very brief flavor for git-filter-repo usage, and answer a
97+
few likely questions about its purpose and rationale (including a short
98+
comparison to other tools). I will provide several links along the way for
99+
curious folks to learn more.
100+
101+
### A simple example
102+
103+
Let's start with a simple example that has come up a lot for me:
104+
extracting a piece of an existing repository and preparing it to be
105+
merged into some larger monorepository. So, we want to:
106+
107+
* extract the history of a single directory, src/. This means that only
108+
paths under src/ remain in the repo, and any commits that only touched
109+
paths outside this directory will be removed.
110+
* rename all files to have a new leading directory, my-module/ (e.g. so that
111+
src/foo.c becomes my-module/src/foo.c)
112+
* rename any tags in the extracted repository to have a 'my-module-'
113+
prefix (to avoid any conflicts when we later merge this repo into
114+
something else)
115+
116+
Doing this with filter-repo is as simple as the following command:
117+
```shell
118+
git filter-repo --path src/ --to-subdirectory-filter my-module --tag-rename '':'my-module-'
119+
```
120+
(the single quotes are unnecessary, but make it clearer to a human that we
121+
are replacing the empty string as a prefix with `my-module-`)
122+
123+
By contrast, filter-branch comes with a pile of caveats even once you
124+
figure out the necessary (os-dependent) invocation(s):
125+
126+
```shell
127+
git filter-branch --index-filter 'git ls-files | grep -v ^src/ | xargs git rm -q --cached; git ls-files -s | sed "s-$(printf \\t)-&my-module/-" | git update-index --index-info; git ls-files | grep -v ^my-module/ | xargs git rm -q --cached' --tag-name-filter 'echo "my-module-$(cat)"' --prune-empty -- --all
128+
git clone file://$(pwd) newcopy
129+
cd newcopy
130+
git for-each-ref --format="delete %(refname)" refs/tags/ | grep -v refs/tags/my-module- | git update-ref --stdin
131+
git gc --prune=now
132+
```
133+
134+
BFG is not capable of this type of rewrite, and this type of rewrite is
135+
difficult to perform safely using fast-export and fast-import directly.
136+
137+
You can find a lot more examples in [filter-repo's
138+
manpage](https://github.com/newren/git-filter-repo/blob/ae43a0ef6d2c7af8f38c5bba38ca0b22942463cf/Documentation/git-filter-repo.txt#L434).
139+
(If you are curious about the "pile of caveats" mentioned above or the
140+
reasons for the extra steps for filter-branch, you can [read more
141+
details about this
142+
example](https://github.com/newren/git-filter-repo#example-usage-comparing-to-filter-branch)).
143+
144+
### Why a new tool instead of contributing to other tools?
145+
146+
There are two well known tools in the repository rewriting space:
147+
148+
* [git-filter-branch](https://git-scm.com/docs/git-filter-branch)
149+
* [BFG Repo Cleaner](https://rtyley.github.io/bfg-repo-cleaner/)
150+
151+
and two lesser-known tools:
152+
153+
* [reposurgeon](http://www.catb.org/~esr/reposurgeon/reposurgeon.html)
154+
* [git-fast-export](https://git-scm.com/docs/git-fast-export) and
155+
[git-fast-import](https://git-scm.com/docs/git-fast-import)
156+
157+
(While fast-export and fast-import themselves are well known, they are
158+
usually thought of as export-to-another-VCS or import-from-another-VCS
159+
tools, though they also work for git->git transitions.)
160+
161+
I will briefly discuss each.
162+
163+
#### filter-branch and BFG
164+
165+
It's natural to ask why, if these well-known tools lacked features I
166+
wanted, they could not have been extended instead of creating a new tool.
167+
In short, they were philosophically the wrong starting point for extension
168+
and they also had the wrong architecture or design to support such an
169+
effort.
170+
171+
From the philosophical angle:
172+
173+
* BFG: easy to use flags for some common cases, but not extensible
174+
* filter-branch: relatively versatile capability via user-specified
175+
shell commands, but rapidly becomes very difficult to use beyond
176+
trivial cases especially as usability defaults increasingly
177+
conflict and cause problems.
178+
179+
I wanted something that made the easy cases simple like BFG, but which
180+
would scale up to more difficult cases and have versatility beyond that
181+
which filter-branch provides.
182+
183+
From the technical architecture/design angle:
184+
185+
* BFG: works on packfiles and packed-refs, directly rewriting tree and
186+
blob objects; Roberto proved you can get a lot done with this design
187+
with his work on the BFG (as many people who have used his tool can
188+
attest), but this design does not permit things like differentiating
189+
paths in different directories with the same basename nor could it be
190+
used to allow renaming of paths (except within the same directory).
191+
Further, this design even sadly runs into a
192+
[lot](https://github.com/newren/git-filter-repo/blob/ae43a0ef6d2c7af8f38c5bba38ca0b22942463cf/contrib/filter-repo-demos/bfg-ish#L32-L39)
193+
[of](https://github.com/newren/git-filter-repo/blob/ae43a0ef6d2c7af8f38c5bba38ca0b22942463cf/contrib/filter-repo-demos/bfg-ish#L29-L31)
194+
[roadblocks](https://github.com/newren/git-filter-repo/blob/ae43a0ef6d2c7af8f38c5bba38ca0b22942463cf/contrib/filter-repo-demos/bfg-ish#L23-L26)
195+
[and](https://github.com/newren/git-filter-repo/blob/ae43a0ef6d2c7af8f38c5bba38ca0b22942463cf/contrib/filter-repo-demos/bfg-ish#L64-L66)
196+
[limitations](https://github.com/newren/git-filter-repo/blob/ae43a0ef6d2c7af8f38c5bba38ca0b22942463cf/contrib/filter-repo-demos/bfg-ish#L27-L28)
197+
even within its intended usecase of removing big or sensitive content.
198+
199+
* filter-branch: performance really shouldn't matter for a one shot
200+
usage tool, but filter-branch can turn a few hour rewrite
201+
(allowing an overnight downtime) into an intractable three month
202+
wait. Further, its design architecture leaks through every level
203+
of the interface, making it nearly impossible to change anything
204+
about the slow design without having backward compatibility
205+
issues. These issues are well known, but what is less well known
206+
is that even ignoring performance, [the usability choices in
207+
filter-branch rapidly become increasingly conflicting and
208+
problematic](https://github.com/newren/git-filter-repo/blob/ae43a0ef6d2c7af8f38c5bba38ca0b22942463cf/contrib/filter-repo-demos/filter-lamely#L9-L61)
209+
for users with larger repos and more involved rewrites,
210+
difficulties that again cannot be ameliorated without breaking
211+
backward compatibility.
212+
213+
#### reposurgeon
214+
215+
Some brief impressions about reposurgeon:
216+
217+
* Appears to be
218+
[almost](http://www.catb.org/~esr/reposurgeon/features.html)
219+
[exclusively](http://www.catb.org/~esr/reposurgeon/dvcs-migration-guide.html)
220+
focused on transitioning between different version control systems
221+
(cvs, svn, hg, bzr, git, etc.), and in particular handling the myriad
222+
edge and corner cases that arise in transitioning from CVS or SVN to a
223+
DVCS.
224+
* Provides very thorough reference-style documentation; if you read all
225+
reposurgeon documentation, you will likely feel as though you can take
226+
an existing example and modify it in many ways.
227+
* [Absolutely no full-fledged
228+
examples](https://public-inbox.org/git/CAA01Csq0eX2L5cKpjjySs+4e0Sm+vp=10C_SAkE6CLpCHBWZ8g@mail.gmail.com/)
229+
[or user-guide style
230+
documentation](https://public-inbox.org/git/CAA01Csp+RpCXO4noewBOMY6qJiBy=Gshv3rUh83ZY2RJ5Th3Ww@mail.gmail.com/)
231+
are provided for getting started.
232+
* Appears to not have any facilities for quick (in terms of time spent by
233+
the user) conversions similar to filter-branch, BFG, or filter-repo.
234+
Users who want such capabilities are likely to be frustrated by
235+
reposurgeon and give up.
236+
* Strikes me as "GDB for history rewriting"; it has lots of facilities
237+
for manually inspecting and editing, but is not intended for the
238+
first-time or casual history spelunker. Only those who view history
239+
spelunking as a frequent hobby or job are likely to dive in. And it's
240+
not quite clear whether it is only useful to those transitioning from
241+
CVS/SVN or whether the facilities would also be useful to others.
242+
* Built on top of fast-export and fast-import, which I contend is the
243+
right architecture for a history filtering tool (see below).
244+
245+
I have read the reposurgeon documentation multiple times over the years,
246+
and am almost at a point where I feel like I know how to get started with
247+
it. I haven't had a need to convert a CVS or SVN repo in over a decade; if
248+
I had such a need, perhaps I'd persevere and learn more about it. I
249+
suspect it has some good ideas I could apply to filter-repo. But I haven't
250+
managed to get started with reposurgeon, so clearly my impressions of it
251+
should be taken with a grain of salt.
252+
253+
#### fast-export and fast-import
254+
255+
Finally, fast-export and fast-import can be used with a little editing of
256+
the fast-export output to handle a number of history rewriting cases. I
257+
have done this many times, but it has some
258+
[drawbacks](https://public-inbox.org/git/CABPp-BGL-3_nhZSpt0Bz0EVY-6-mcbgZMmx4YcXEfA_ZrTqFUw@mail.gmail.com/):
259+
260+
* Dangerous for programmatic edits: It's tempting to use sed or perl
261+
one-liners to e.g. try to modify filenames, but you risk accidentally
262+
also munging unrelated data such as commit messages, file contents, and
263+
branch and tag names.
264+
* Easy to miss corner cases: for example, fast-export only quotes
265+
filenames when necessary; as such, your attempt to rename a directory
266+
might leave files with spaces or UTF-8 characters in their original
267+
location.
268+
* Difficult to directly provide higher level facilities: for example,
269+
rewriting (possibly abbreviated) commit hashes in commit messages to
270+
refer to the new commit hashes, or stripping of non-merge commits which
271+
become empty or merge commits which become degenerate and empty.
272+
* Misses a lot of pieces needed to round things out into a usable
273+
tool
274+
275+
However, fast-export and fast-import are the right architecture for
276+
building a repository filtering tool on top of; they are fast, provide
277+
access to almost all aspects of a repository in a very machine-parseable
278+
format, and will continue to gain features and capabilities over time
279+
(e.g. when replace refs were added, fast-export and fast-import immediately
280+
gained support). To create a full repository surgery tool, you "just" need
281+
to [combine fast-export and fast-import together with a whole lot of
282+
parsing and
283+
glue](https://github.com/newren/git-filter-repo#how-filter-repo-works),
284+
which, in a nutshell, is what filter-repo is.
285+
286+
#### Upstream improvements
287+
288+
But to circle back to the question of improving existing tools, during the
289+
development of filter-repo and its predecessor, lots of [improvements to
290+
both fast-export and
291+
fast-import](https://github.com/newren/git-filter-repo/tree/develop#upstream-improvements)
292+
were submitted and included in git.git.
293+
294+
(Also, [filter-repo started in early 2009 as
295+
git_fast_filter.py](https://public-inbox.org/git/[email protected]/)
296+
and therefore technically predates both BFG and reposurgeon.)
297+
298+
### Why not a builtin command?
299+
300+
One could ask why this new command is not written in C like most of git.
301+
While that would have several advantages, it doesn't meet the necessary
302+
design requirements. See the ["VERSATILITY" section of the
303+
manpage](https://github.com/newren/git-filter-repo/blob/ae43a0ef6d2c7af8f38c5bba38ca0b22942463cf/Documentation/git-filter-repo.txt#L306-L326)
304+
or see the "Versatility" section under the [Design Rationale of the
305+
README](https://github.com/newren/git-filter-repo#design-rationale-behind-filter-repo-why-create-a-new-tool).
306+
307+
Technically, we could perhaps provide a mechanism for people to write
308+
and compile plugins that a builtin command could load, but having users
309+
write filtering functions in C sounds suboptimal, and requiring gcc for
310+
filter-repo sounds more onerous than using python.
311+
312+
### Where to from here?
313+
314+
This was just a quick intro to filter-repo, and I've provided a lot of
315+
links above if you want to learn more. Just a few more that might be of
316+
interest:
317+
318+
* [Ramifications of repository
319+
rewrites](https://github.com/newren/git-filter-repo/blob/ae43a0ef6d2c7af8f38c5bba38ca0b22942463cf/Documentation/git-filter-repo.txt#L340-L350);
320+
including
321+
[some](https://github.com/newren/git-filter-repo/blob/ae43a0ef6d2c7af8f38c5bba38ca0b22942463cf/Documentation/git-filter-repo.txt#L376-L410)
322+
[tips](https://github.com/newren/git-filter-repo/blob/ae43a0ef6d2c7af8f38c5bba38ca0b22942463cf/Documentation/git-filter-repo.txt#L426-L431)
323+
(not specific to filter-repo)
324+
* [Finding big objects/directories/extensions (and renames) in your
325+
repo](https://github.com/newren/git-filter-repo/blob/ae43a0ef6d2c7af8f38c5bba38ca0b22942463cf/Documentation/git-filter-repo.txt#L356-L361)
326+
(can be used together with tools other than filter-repo too)
327+
* [Creating new history rewriting tools](https://github.com/newren/git-filter-repo/tree/master/contrib/filter-repo-demos)
81328

82329
## Developer Spotlight: Jean-Noël Avila
83330

0 commit comments

Comments
 (0)