Skip to content

GitPython doesn't have an efficient way to extract information from all commits #378

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
varenc opened this issue Jan 26, 2016 · 2 comments
Closed

Comments

@varenc
Copy link

varenc commented Jan 26, 2016

I'm using GitPython to extract git commit log information from a large repo. Think 30,000+ commits.

If I do something like this...

repo = Repo('...')
commits = list(repo.iter_commits('master'))
for commit in commits:
    yield commit.author.name

GitPython will end up calling out to the git once to get the list of commits, and then again for every single commit message. This makes it take very long. It'd be more efficient if there was a way to instruct GitPython to extract extended commit information from a list of commits initially (perhaps in git.rev_list) instead of having it look up extended information on a single commit by commit basis. I tried extracting extra information by passing in format=full on iter_commits but that caused an error in the parsing code. (assert len(hexsha) == 40, "Invalid line: %s" % hexsha)

Is there any way to do this efficiently? I totally understand if this is the sort of thing that GitPython isn't really optimizing for. I'm just resorting back to calling "git log" for now.

@varenc varenc changed the title GitPython doesn't have an efficient way to extract all information GitPython doesn't have an efficient way to extract information from all commits Jan 26, 2016
@Byron
Copy link
Member

Byron commented Jan 30, 2016

For questions, I recommend using stackoverflow along with the GitPython tag, as you will obtain answers much faster, and from a bigger audience.

That said, GitPython is streaming all its data and indeed built for efficiency (as far as python allows), and I don't believe there is a better way to do what you are doing with GitPython.

However, what you can try is to switch the backend implementation, which might change the performance characteristics. As I don't know which version of GitPython you are using, and backend defaults changed, you might just want to try both backends explicitly:

import git
repo = git.Repo('.', odbt=git.GitCmdObjectDB);
# or 
repo = git.Repo('.', odbt=git.GitDB);

@Byron Byron closed this as completed Jan 30, 2016
@varenc
Copy link
Author

varenc commented Feb 3, 2016

Thanks for your reply. I created an issue because I explored alternatives and saw that this was a limitation of GitPython. Choice of backend has no affect. GitPython is certainly built for efficiency in many ways but it does not seem built for efficiently extracting extended commit information from a very large repo.

If there's interest I could create a few scripts that clearly demonstrates this and compares GitPython performance with manually running a git log command and parsing the output, but this may well be out of scope.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

No branches or pull requests

2 participants