Skip to content
This repository was archived by the owner on Sep 11, 2020. It is now read-only.

Create merge-base feature #1097

Merged
merged 1 commit into from
Jun 3, 2019
Merged

Conversation

dpordomingo
Copy link
Contributor

@dpordomingo dpordomingo commented Mar 25, 2019

suggested by #1078

blocked by src-d/go-git-fixtures#13
blocks src-d/lookout#530

This PR adds some features from git merge-base command.

note:
you can see more details about this PR, alternatives and performance concerns in the comments below

Iit was implemented a new walker filterCommitIter (like bfsCommitIterator one), and three functions MergeBase, IsAncestor and Independents.

  • plumbing/object.filterCommitIter

    it's a CommitIter that walks the commit history (in Breadth-first order) of the passed commit. It only returns the commits that validate the passed isValid CommitFilter, and only traverse the commits that do not validate isLimit CommitFilter.

  • git.MergeBase(first, second *object.Commit) ([]*object.Commit, error)

    it mimics the behavior of git merge-base a b, returning the best common ancestor of the two passed commits; the best common ancestors cannot be reached from other common ancestors.

  • git.IsAncestor(first, second *object.Commit) (bool, error)

    it returns true if the first commit is an ancestor of the second one, like git merge --is-ancestor does

  • git.Independents(commits []*object.Commit) ([]*object.Commit, error)

    it returns a subset of the passed commits, that are not reachable from any other, as git merge-base --independent commit... does

@dpordomingo
Copy link
Contributor Author

alternatives

(A) plumbing/object.filterCommitIter could be replaced by the already existent plumbing/object.bfsCommitIterator:

  • pros: less new code,
  • cons:
    • the functionality of isLimit and isValid would be lost, and it should be implemented again and again for those who need it (like in git.MergeBase, but also in any other cases where it could be needed)
    • considering the performance analysis done below, imo the proposed improvement could not be done (because current plumbing/object.bfsCommitIterator does not let you remove items from the iter queue once they have been added, as isLimit functionality could do).

(B) since plumbing/object.bfsCommitIterator is a special subcase of plumbing/object.filterCommitIter, it could be (safely) deleted, and plumbing/object.NewCommitIterBSF could be rewritten using the new iter, as done here dpordomingo@ab3f18f

  • pros: less code,
  • cons: idk if there could be some tricky use case of current plumbing/object.bfsCommitIterator not being documented nor tested in current tests; example: the async usage of "seenExternal", as it was added as a test-case in dpordomingo@ab3f18f, and it would be also supported with that new implementation.

I'd vote for (B), but it could be done in a subsequent PR, or if required during the revision.

@dpordomingo
Copy link
Contributor Author

Explaining recursive behavior

given git.Independents([]*object.Commit) ([]*object.Commit, error) relies on a recursive function, @jfontan exposed his concerns about its performance at src-d/lookout#530 (comment).

Here is the explanation about how it works 👇

The history can potentially be walked as many times as candidates provided for Independents.
But every time a candidate is found as an ancestor of any other, it is removed from the candidates' list, so it won't be considered anymore.
The order in which you request the candidates no longer affects how many times the history is traversed because they're previously sorted by Committer.When desc, assuming that older commits most probably won't be descendant of newer ones.

In the worst scenario, the history of each candidate will be fully traversed as many times as orphan branches are included in the candidates to be analyzed.

given this history:

          o----o----B~2---B^---B
         /         /
--A~4---o---A~2---o---A

---o---C^---C

then (assuming that topological order matches the historical order):

  • (A) Independents(B, B^); 1step; walks the history only once, because starting by B, it reaches in one step B^; once B^ is reached, it is removed from the candidates list, and since there only one candidate left (B), it is returned with no more steps.
  • (B)Independents(B^, B); 1step, because ordering the candidates by Committer.When desc, will convert them as being (A);
  • (C)Independents(B, B^, B~2); 2steps; walks only once (as in the previous example), but it requires two steps to reach B^ and then B~2
  • (D)Independents(B, A~4); 7steps; walks the history only once (as in the previous example), but it will require more steps to reach A~4
  • (E) Independents(B, A); 1full+2steps; during first walk from B it won't be reached A, and during the second walk from A, it will be aborted once A~2 is reached, because, beyond that commit, all the history is already known.
  • (F) Independents(B, B^, B~2, A, A~2, A~4, C, C^); 2full+2steps; during the first history walk from B it will be removed B^, B~2, A~2, A~4 from the candidates list, so the second iteration will consider only B, A, C, C^; then from A the walk will be stopped once reached A~2 as in (E); the third iteration, starting from C will find (and remove) C^ from the list, so the result will be only A, B and C.

assuming that topological order DOES NOT match the historical order for B and B^:

  • (G), 1full + 1step; walks the history twice, because first iteration walking from B^ does not reach B after a full walk from B^; then in the second iteration walking from B it is the same case as (A), so it is returned A with only one extra step

@dpordomingo
Copy link
Contributor Author

dpordomingo commented Mar 25, 2019

Performance "benchmarks"

As proposed by @jfontan src-d/lookout#530 (comment), I did some tests:

Running over kubernetes/kubernetes and torvalds/linux

The time needed to calculate the common ancestor with go-git stays as expected and explained above, and in edge-cases, it is almost the same than the time required to do a log between the involved commits:

  • when one of the passed commits is an ancestor of the other, the time needed depends on its distance;
  • when the passed commits are Independents, it is needed one full history scan from the newer, and the process takes as much as go-git needs to iterate over all its ancestors;

On kubernetes/kubernetes and torvalds/linux, doing a full traverse of its histories from master:

repo master ancestors list all list only one
kubernetes/kubernetes 76018 commits ~5s ~0.03s
torvalds/linux 812644 commits ~60s ~5s

git log master tooked ~10s over torvalds/linux, and almost ~1s over kubernetes/kubernetes.

Testing real PRs from kubernetes/kubernetes:

I tried go-git/_examples/merge_base command over the last 245 merged PR from kubernetes/kubernetes (there is no valid PRs in linux repo), to calculate the merge-base between their base and head:

  • 140 PRs required a full traverse, taking ~5s,
  • 105 PRs were analyzed almost instantaneously in ~.3s, without a full history traverse.

Conclusions and possible improvements:

When the passed commits are Independents it is needed to fully traverse the history at least once.

There is a possible improvement to avoid it:

  • Stop walking the history of the ancestors of the common ancestors already found.

If you want, it could be done in this or in a subsequent PR.

@dpordomingo
Copy link
Contributor Author

Travis is failing https://travis-ci.org/src-d/go-git/jobs/511125046#L668 because the new test cases depend on new fixtures, to be merged by src-d/go-git-fixtures#13

@dpordomingo
Copy link
Contributor Author

AppVeyor failed for the same reason; now that src-d/go-git-fixtures#13 was merged it should pass, but I can not restart the build.

@mcuadros
Copy link
Contributor

mcuadros commented Apr 2, 2019

I have some doubts about the API, since go-git, is getting being some times is hard to find if X functionality is implemented or not. How this PR is done, the functions are spare, in a big package, some my I want to consider 2 different options:

  • Move the functions to be methods of Commit, fits perfect for MergeBase, IsAncestor, not for Independents
  • Move the functions to a new package, called merge or something similar

/cc @jfontan @smola

@dpordomingo
Copy link
Contributor Author

dpordomingo commented Apr 3, 2019

I agree with your @mcuadros and both proposals would fit me, so just let me know.
About keeping Independents in a different package than the two others... I don't like it; I'd move all at once
Would it be ok if moved to plumbing/mergebase or similar?

@smola
Copy link
Collaborator

smola commented Apr 3, 2019

merge or mergebase package names look too specific to hold all MergeBase, IsAncestor and Independents, since the later two are more general.

@mcuadros
Copy link
Contributor

mcuadros commented Apr 7, 2019

@smola what about the other option?

@dpordomingo
Copy link
Contributor Author

@smola I removed the recursive call at 5e0de4e

@jfontan
Copy link
Contributor

jfontan commented Apr 9, 2019

I agree with @mcuadros that MergeBase and IsAncestor belongs to Commit. Independents belongs to plumbing. Maybe plumbing/object.

@yelirekim
Copy link
Contributor

Is this work abandoned, or likely to move forward? I'm trying to decide whether to fork and merge this, or if it's likely to end up in the upstream sometime soon.

Is all that's left here to decide on where the package lines are drawn?

@dpordomingo
Copy link
Contributor Author

Not dead on my side ;)
I'll resume the work now that my vacations finished. If the maintainers agree :)

@dpordomingo
Copy link
Contributor Author

Thanks for your suggestions!

Done as you recommended during your code review:
All the new functionality was moved ef3c257 into plumbing/object:

  • MergeBase and IsAncestor will belong to plumbing/object.Commit
  • Independents will be an independent function in plumbing/object

Also, I refactored IsAncestor to be more performant 52ede6c

should I address any other issue?

I should squash some commits in order to have a cleaner history, so let me know before you merge.

@mcuadros
Copy link
Contributor

mcuadros commented Jun 3, 2019

Yes please clean the history

Signed-off-by: David Pordomingo <[email protected]>
@dpordomingo
Copy link
Contributor Author

@mcuadros squashed in one commit f62852b

@mcuadros mcuadros merged commit 37b8072 into src-d:master Jun 3, 2019
@ces131
Copy link

ces131 commented Jun 3, 2019

Is there going to be a new release soon containing this merge?? (If not, there definitely should be!)

@dpordomingo
Copy link
Contributor Author

I'll also rebase #1096 on top of current go-git, now that this has been merged.
Imo it could be useful if it would be released at the same time that this change.

@dpordomingo dpordomingo deleted the merge-base-core branch June 3, 2019 18:43
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants