@@ -78,6 +78,253 @@ This edition covers what happened during the month of July 2019.
78
78
### Support
79
79
-->
80
80
81
+ ## An Introduction to git-filter-repo
82
+
83
+ There is a new tool available for surgery on git repositories:
84
+ [ git-filter-repo] ( https://github.com/newren/git-filter-repo ) . It
85
+ claims to have [ many new unique
86
+ features] ( https://github.com/newren/git-filter-repo#design-rationale-behind-filter-repo-why-create-a-new-tool ) ,
87
+ [ good
88
+ performance] ( https://public-inbox.org/git/CABPp-BGOz8nks0+Tdw5GyGqxeYR-3FF6FT5JcgVqZDYVRQ6qog@mail.gmail.com/ ) ,
89
+ and an ability to scale -- from making simple history rewrites
90
+ trivial, to facilitating the creation of entirely new tools which
91
+ leverage existing capabilities to handle more complex cases.
92
+
93
+ You can read more about [ common usecases and base capabilities of
94
+ filter-repo] ( https://github.com/newren/git-filter-repo/blob/ae43a0ef6d2c7af8f38c5bba38ca0b22942463cf/Documentation/git-filter-repo.txt#L17-L55 ) ,
95
+ but in this article, I'd like to focus on two things: providing a simple
96
+ example to give a very brief flavor for git-filter-repo usage, and answer a
97
+ few likely questions about its purpose and rationale (including a short
98
+ comparison to other tools). I will provide several links along the way for
99
+ curious folks to learn more.
100
+
101
+ ### A simple example
102
+
103
+ Let's start with a simple example that has come up a lot for me:
104
+ extracting a piece of an existing repository and preparing it to be
105
+ merged into some larger monorepository. So, we want to:
106
+
107
+ * extract the history of a single directory, src/. This means that only
108
+ paths under src/ remain in the repo, and any commits that only touched
109
+ paths outside this directory will be removed.
110
+ * rename all files to have a new leading directory, my-module/ (e.g. so that
111
+ src/foo.c becomes my-module/src/foo.c)
112
+ * rename any tags in the extracted repository to have a 'my-module-'
113
+ prefix (to avoid any conflicts when we later merge this repo into
114
+ something else)
115
+
116
+ Doing this with filter-repo is as simple as the following command:
117
+ ``` shell
118
+ git filter-repo --path src/ --to-subdirectory-filter my-module --tag-rename ' ' :' my-module-'
119
+ ```
120
+ (the single quotes are unnecessary, but make it clearer to a human that we
121
+ are replacing the empty string as a prefix with ` my-module- ` )
122
+
123
+ By contrast, filter-branch comes with a pile of caveats even once you
124
+ figure out the necessary (os-dependent) invocation(s):
125
+
126
+ ``` shell
127
+ git filter-branch --index-filter ' git ls-files | grep -v ^src/ | xargs git rm -q --cached; git ls-files -s | sed "s-$(printf \\t)-&my-module/-" | git update-index --index-info; git ls-files | grep -v ^my-module/ | xargs git rm -q --cached' --tag-name-filter ' echo "my-module-$(cat)"' --prune-empty -- --all
128
+ git clone file://$( pwd) newcopy
129
+ cd newcopy
130
+ git for-each-ref --format=" delete %(refname)" refs/tags/ | grep -v refs/tags/my-module- | git update-ref --stdin
131
+ git gc --prune=now
132
+ ```
133
+
134
+ BFG is not capable of this type of rewrite, and this type of rewrite is
135
+ difficult to perform safely using fast-export and fast-import directly.
136
+
137
+ You can find a lot more examples in [ filter-repo's
138
+ manpage] ( https://github.com/newren/git-filter-repo/blob/ae43a0ef6d2c7af8f38c5bba38ca0b22942463cf/Documentation/git-filter-repo.txt#L434 ) .
139
+ (If you are curious about the "pile of caveats" mentioned above or the
140
+ reasons for the extra steps for filter-branch, you can [ read more
141
+ details about this
142
+ example] ( https://github.com/newren/git-filter-repo#example-usage-comparing-to-filter-branch ) ).
143
+
144
+ ### Why a new tool instead of contributing to other tools?
145
+
146
+ There are two well known tools in the repository rewriting space:
147
+
148
+ * [ git-filter-branch] ( https://git-scm.com/docs/git-filter-branch )
149
+ * [ BFG Repo Cleaner] ( https://rtyley.github.io/bfg-repo-cleaner/ )
150
+
151
+ and two lesser-known tools:
152
+
153
+ * [ reposurgeon] ( http://www.catb.org/~esr/reposurgeon/reposurgeon.html )
154
+ * [ git-fast-export] ( https://git-scm.com/docs/git-fast-export ) and
155
+ [ git-fast-import] ( https://git-scm.com/docs/git-fast-import )
156
+
157
+ (While fast-export and fast-import themselves are well known, they are
158
+ usually thought of as export-to-another-VCS or import-from-another-VCS
159
+ tools, though they also work for git->git transitions.)
160
+
161
+ I will briefly discuss each.
162
+
163
+ #### filter-branch and BFG
164
+
165
+ It's natural to ask why, if these well-known tools lacked features I
166
+ wanted, they could not have been extended instead of creating a new tool.
167
+ In short, they were philosophically the wrong starting point for extension
168
+ and they also had the wrong architecture or design to support such an
169
+ effort.
170
+
171
+ From the philosophical angle:
172
+
173
+ * BFG: easy to use flags for some common cases, but not extensible
174
+ * filter-branch: relatively versatile capability via user-specified
175
+ shell commands, but rapidly becomes very difficult to use beyond
176
+ trivial cases especially as usability defaults increasingly
177
+ conflict and cause problems.
178
+
179
+ I wanted something that made the easy cases simple like BFG, but which
180
+ would scale up to more difficult cases and have versatility beyond that
181
+ which filter-branch provides.
182
+
183
+ From the technical architecture/design angle:
184
+
185
+ * BFG: works on packfiles and packed-refs, directly rewriting tree and
186
+ blob objects; Roberto proved you can get a lot done with this design
187
+ with his work on the BFG (as many people who have used his tool can
188
+ attest), but this design does not permit things like differentiating
189
+ paths in different directories with the same basename nor could it be
190
+ used to allow renaming of paths (except within the same directory).
191
+ Further, this design even sadly runs into a
192
+ [ lot] ( https://github.com/newren/git-filter-repo/blob/ae43a0ef6d2c7af8f38c5bba38ca0b22942463cf/contrib/filter-repo-demos/bfg-ish#L32-L39 )
193
+ [ of] ( https://github.com/newren/git-filter-repo/blob/ae43a0ef6d2c7af8f38c5bba38ca0b22942463cf/contrib/filter-repo-demos/bfg-ish#L29-L31 )
194
+ [ roadblocks] ( https://github.com/newren/git-filter-repo/blob/ae43a0ef6d2c7af8f38c5bba38ca0b22942463cf/contrib/filter-repo-demos/bfg-ish#L23-L26 )
195
+ [ and] ( https://github.com/newren/git-filter-repo/blob/ae43a0ef6d2c7af8f38c5bba38ca0b22942463cf/contrib/filter-repo-demos/bfg-ish#L64-L66 )
196
+ [ limitations] ( https://github.com/newren/git-filter-repo/blob/ae43a0ef6d2c7af8f38c5bba38ca0b22942463cf/contrib/filter-repo-demos/bfg-ish#L27-L28 )
197
+ even within its intended usecase of removing big or sensitive content.
198
+
199
+ * filter-branch: performance really shouldn't matter for a one shot
200
+ usage tool, but filter-branch can turn a few hour rewrite
201
+ (allowing an overnight downtime) into an intractable three month
202
+ wait. Further, its design architecture leaks through every level
203
+ of the interface, making it nearly impossible to change anything
204
+ about the slow design without having backward compatibility
205
+ issues. These issues are well known, but what is less well known
206
+ is that even ignoring performance, [ the usability choices in
207
+ filter-branch rapidly become increasingly conflicting and
208
+ problematic] ( https://github.com/newren/git-filter-repo/blob/ae43a0ef6d2c7af8f38c5bba38ca0b22942463cf/contrib/filter-repo-demos/filter-lamely#L9-L61 )
209
+ for users with larger repos and more involved rewrites,
210
+ difficulties that again cannot be ameliorated without breaking
211
+ backward compatibility.
212
+
213
+ #### reposurgeon
214
+
215
+ Some brief impressions about reposurgeon:
216
+
217
+ * Appears to be
218
+ [ almost] ( http://www.catb.org/~esr/reposurgeon/features.html )
219
+ [ exclusively] ( http://www.catb.org/~esr/reposurgeon/dvcs-migration-guide.html )
220
+ focused on transitioning between different version control systems
221
+ (cvs, svn, hg, bzr, git, etc.), and in particular handling the myriad
222
+ edge and corner cases that arise in transitioning from CVS or SVN to a
223
+ DVCS.
224
+ * Provides very thorough reference-style documentation; if you read all
225
+ reposurgeon documentation, you will likely feel as though you can take
226
+ an existing example and modify it in many ways.
227
+ * [ Absolutely no full-fledged
228
+ examples] ( https://public-inbox.org/git/CAA01Csq0eX2L5cKpjjySs+4e0Sm+vp=10C_SAkE6CLpCHBWZ8g@mail.gmail.com/ )
229
+ [ or user-guide style
230
+ documentation] ( https://public-inbox.org/git/CAA01Csp+RpCXO4noewBOMY6qJiBy=Gshv3rUh83ZY2RJ5Th3Ww@mail.gmail.com/ )
231
+ are provided for getting started.
232
+ * Appears to not have any facilities for quick (in terms of time spent by
233
+ the user) conversions similar to filter-branch, BFG, or filter-repo.
234
+ Users who want such capabilities are likely to be frustrated by
235
+ reposurgeon and give up.
236
+ * Strikes me as "GDB for history rewriting"; it has lots of facilities
237
+ for manually inspecting and editing, but is not intended for the
238
+ first-time or casual history spelunker. Only those who view history
239
+ spelunking as a frequent hobby or job are likely to dive in. And it's
240
+ not quite clear whether it is only useful to those transitioning from
241
+ CVS/SVN or whether the facilities would also be useful to others.
242
+ * Built on top of fast-export and fast-import, which I contend is the
243
+ right architecture for a history filtering tool (see below).
244
+
245
+ I have read the reposurgeon documentation multiple times over the years,
246
+ and am almost at a point where I feel like I know how to get started with
247
+ it. I haven't had a need to convert a CVS or SVN repo in over a decade; if
248
+ I had such a need, perhaps I'd persevere and learn more about it. I
249
+ suspect it has some good ideas I could apply to filter-repo. But I haven't
250
+ managed to get started with reposurgeon, so clearly my impressions of it
251
+ should be taken with a grain of salt.
252
+
253
+ #### fast-export and fast-import
254
+
255
+ Finally, fast-export and fast-import can be used with a little editing of
256
+ the fast-export output to handle a number of history rewriting cases. I
257
+ have done this many times, but it has some
258
+ [ drawbacks] ( https://public-inbox.org/git/CABPp-BGL-3_nhZSpt0Bz0EVY-6-mcbgZMmx4YcXEfA_ZrTqFUw@mail.gmail.com/ ) :
259
+
260
+ * Dangerous for programmatic edits: It's tempting to use sed or perl
261
+ one-liners to e.g. try to modify filenames, but you risk accidentally
262
+ also munging unrelated data such as commit messages, file contents, and
263
+ branch and tag names.
264
+ * Easy to miss corner cases: for example, fast-export only quotes
265
+ filenames when necessary; as such, your attempt to rename a directory
266
+ might leave files with spaces or UTF-8 characters in their original
267
+ location.
268
+ * Difficult to directly provide higher level facilities: for example,
269
+ rewriting (possibly abbreviated) commit hashes in commit messages to
270
+ refer to the new commit hashes, or stripping of non-merge commits which
271
+ become empty or merge commits which become degenerate and empty.
272
+ * Misses a lot of pieces needed to round things out into a usable
273
+ tool
274
+
275
+ However, fast-export and fast-import are the right architecture for
276
+ building a repository filtering tool on top of; they are fast, provide
277
+ access to almost all aspects of a repository in a very machine-parseable
278
+ format, and will continue to gain features and capabilities over time
279
+ (e.g. when replace refs were added, fast-export and fast-import immediately
280
+ gained support). To create a full repository surgery tool, you "just" need
281
+ to [ combine fast-export and fast-import together with a whole lot of
282
+ parsing and
283
+ glue] ( https://github.com/newren/git-filter-repo#how-filter-repo-works ) ,
284
+ which, in a nutshell, is what filter-repo is.
285
+
286
+ #### Upstream improvements
287
+
288
+ But to circle back to the question of improving existing tools, during the
289
+ development of filter-repo and its predecessor, lots of [ improvements to
290
+ both fast-export and
291
+ fast-import] ( https://github.com/newren/git-filter-repo/tree/develop#upstream-improvements )
292
+ were submitted and included in git.git.
293
+
294
+ (Also, [ filter-repo started in early 2009 as
295
+ git_fast_filter.py
] ( https://public-inbox.org/git/[email protected] / )
296
+ and therefore technically predates both BFG and reposurgeon.)
297
+
298
+ ### Why not a builtin command?
299
+
300
+ One could ask why this new command is not written in C like most of git.
301
+ While that would have several advantages, it doesn't meet the necessary
302
+ design requirements. See the [ "VERSATILITY" section of the
303
+ manpage] ( https://github.com/newren/git-filter-repo/blob/ae43a0ef6d2c7af8f38c5bba38ca0b22942463cf/Documentation/git-filter-repo.txt#L306-L326 )
304
+ or see the "Versatility" section under the [ Design Rationale of the
305
+ README] ( https://github.com/newren/git-filter-repo#design-rationale-behind-filter-repo-why-create-a-new-tool ) .
306
+
307
+ Technically, we could perhaps provide a mechanism for people to write
308
+ and compile plugins that a builtin command could load, but having users
309
+ write filtering functions in C sounds suboptimal, and requiring gcc for
310
+ filter-repo sounds more onerous than using python.
311
+
312
+ ### Where to from here?
313
+
314
+ This was just a quick intro to filter-repo, and I've provided a lot of
315
+ links above if you want to learn more. Just a few more that might be of
316
+ interest:
317
+
318
+ * [ Ramifications of repository
319
+ rewrites] ( https://github.com/newren/git-filter-repo/blob/ae43a0ef6d2c7af8f38c5bba38ca0b22942463cf/Documentation/git-filter-repo.txt#L340-L350 ) ;
320
+ including
321
+ [ some] ( https://github.com/newren/git-filter-repo/blob/ae43a0ef6d2c7af8f38c5bba38ca0b22942463cf/Documentation/git-filter-repo.txt#L376-L410 )
322
+ [ tips] ( https://github.com/newren/git-filter-repo/blob/ae43a0ef6d2c7af8f38c5bba38ca0b22942463cf/Documentation/git-filter-repo.txt#L426-L431 )
323
+ (not specific to filter-repo)
324
+ * [ Finding big objects/directories/extensions (and renames) in your
325
+ repo] ( https://github.com/newren/git-filter-repo/blob/ae43a0ef6d2c7af8f38c5bba38ca0b22942463cf/Documentation/git-filter-repo.txt#L356-L361 )
326
+ (can be used together with tools other than filter-repo too)
327
+ * [ Creating new history rewriting tools] ( https://github.com/newren/git-filter-repo/tree/master/contrib/filter-repo-demos )
81
328
82
329
## Developer Spotlight: Jean-Noël Avila
83
330
0 commit comments