-
Notifications
You must be signed in to change notification settings - Fork 534
revlist is slow for non-local storage layers #909
Comments
@smola @erizocosmico @mcuadros or anyone else, any thoughts on this issue? I'd like to improve the performance here, but I'm hesitant to put a lot of effort into something that won't be accepted upstream, or which might be introducing bugs due to my minimal lack of git protocol knowledge. Thanks! |
@strib I think we would favor adopting the same or similar algorithm to what git and jgit have. We initially took this shortcut of doing two full object traversals, but both git and jgit seem to be quite smarter. You can search their code base for the terms An alternative/complementary approach is implementing support to generate pack bitmap indexes and use them when calculating reachable objects. Here is the format spec:
|
@smola thanks for the response and the suggestions. The bitmap index thing is interesting, though it might be a challenge for Keybase specifically since there's no real git "server" for us, just a bunch of distributed clients coordinating with one shared storage layer. Certainly it would be possible to use locks in a smart way to elect a single client to generate the map (which is basically what we do already for updating a reference), but it might be a bit hacky to drop that into the go-git code. When I (or someone else on our team) have time to work on this, we'll look closely at what the existing implementations do and try to adapt them to go-git. Thanks! |
@smola I did some research into what git and jgit do. git:For each reference commit on the remote, it marks that commit and its 5 immediate parents as "uninteresting": Then, for each of those uninteresting commits, it marks the entire tree and all blobs "uninteresting" (skipping those already marked of course): So overall this is similar to go-git except that it limits it to only 5 commits, rather than the full commit history. jgit:I'm less sure about my read here, but I think it's similar to the above, except that instead of 5 commits, it uses 500: Experiment:I tried a quick implementation that limited the commit depth to 5 in go-git, and ran it against a known bad test case with our distributed storage layer. The results were somewhat underwhelming -- a 10-30% reduction in time to "push" one new commit, but the overall time spent was still way larger than it we want. So this change alone is probably not good enough for us. But I can clean it up and submit it as a PR here if you think it would be useful. Other thoughts:In our case, it would be way faster to traverse the commit trees of the "remote refs" via the "remote" repo itself (which for our purposes, as explained above, is really the local git repo). So I think I'm going to explore some hacky way to allow the remote repo to get the ignore list that way instead. For example, we can make the revlist to take an optional second storage layer to use to find the ignore list, and if Thanks! |
`ObjectsWithStorageForIgnores` is the same as `Objects`, but a secondary storage layer can be provided, to be used to finding the full set of objects to be ignored while finding the reachable objects. This is useful when the main `s` storage layer is slow and/or remote, while the ignore list is available somewhere local. Issue: src-d#909 Signed-off-by: Jeremy Stribling <[email protected]>
This factors out some URL-parsing code from the transport layer so it can be used by config as well. Issue: src-d#909 Signed-off-by: Jeremy Stribling <[email protected]>
Issue: src-d#909 Signed-off-by: Jeremy Stribling <[email protected]>
I just put up #1066 for review, which resulted in a huge performance win for fetches/pull in Keybase git. I hope you'll consider merging it! |
`ObjectsWithStorageForIgnores` is the same as `Objects`, but a secondary storage layer can be provided, to be used to finding the full set of objects to be ignored while finding the reachable objects. This is useful when the main `s` storage layer is slow and/or remote, while the ignore list is available somewhere local. Issue: src-d#909 Signed-off-by: Jeremy Stribling <[email protected]>
This factors out some URL-parsing code from the transport layer so it can be used by config as well. Issue: src-d#909 Signed-off-by: Jeremy Stribling <[email protected]>
Issue: src-d#909 Signed-off-by: Jeremy Stribling <[email protected]>
`ObjectsWithStorageForIgnores` is the same as `Objects`, but a secondary storage layer can be provided, to be used to finding the full set of objects to be ignored while finding the reachable objects. This is useful when the main `s` storage layer is slow and/or remote, while the ignore list is available somewhere local. Issue: src-d#909 Signed-off-by: Jeremy Stribling <[email protected]>
This factors out some URL-parsing code from the transport layer so it can be used by config as well. Issue: src-d#909 Signed-off-by: Jeremy Stribling <[email protected]>
Issue: src-d#909 Signed-off-by: Jeremy Stribling <[email protected]>
Over in keybase/client#12730, we have a user of Keybase Git who noticed it can be very slow (and CPU-intensive) to fetch a single new commit from certain repos. Under the cover this is powered by a go-git
Remote.Push()
call, which uses the Keybase storage layer as theRemote.Storer
instance, and uses the local git checkout as the remote side of the connection. (So the "fetch" becomes a push from the Keybase repo to the local checkout.)The push is implemented by first getting the advertised reference heads from the local git checkout, and passing those commits as the ignore list to
revlist.Objects()
, along with the commit hashes being pushed out of Keybase repo. The performance problem seems to happen here, which walks the entire commit history and the entire directory tree for each of the commits, in order to build up a more complete ignore list for the real revlist computation a few lines below.This works acceptably fast when the storage layer is a local git repo, but when it's a distributed storage layer like KBFS, this can be very slow. And it seems unnecessary -- most of the ignored object IDs likely won't be used, since the tree-walking at the real computation will bail out at the first
seen
object it sees while walking the tree, which for most single-commit fetches will be most of the entries of the root directory.I think there is a better algorithm for this. I started playing around with one here, but it's pretty stupid and while it does definitely improves things for Keybase, it's not by as much as I think it should be able to.
I haven't worked out a complete solution yet, as I am hoping for dev feedback first. But I think the outline could be something like this:
revlist.Objects
, for each object inobjs
, find the most-recent parent commit ID inignore
. Call the set of these commitsparents
.parents
, compute agit diff-tree
-like algorithm against the corresponding commit inobjs
. (I'm not sure if this algorithm exists yet ingo-git
anywhere.) This would traverse and compare the directory entries between the two commits, and find the exact places they differ (traversing only as deeply as needed in each tree to figure out the differences).ignore
list, along with all theparents
.revlist
using this newignore
list,reachableObjects
should return early as soon as it reaches a commit in the ignore list, and the pending list is empty.I could be wrong, but I think this is a more efficient algorithm in all cases. Of course, it requires a bit more code complexity.
What do the devs think? cc: @smola @erizocosmico @mcuadros
The text was updated successfully, but these errors were encountered: