GitPython doesn't have an efficient way to extract information from all commits #378

varenc · 2016-01-26T19:37:40Z

I'm using GitPython to extract git commit log information from a large repo. Think 30,000+ commits.

If I do something like this...

repo = Repo('...')
commits = list(repo.iter_commits('master'))
for commit in commits:
    yield commit.author.name

GitPython will end up calling out to the git once to get the list of commits, and then again for every single commit message. This makes it take very long. It'd be more efficient if there was a way to instruct GitPython to extract extended commit information from a list of commits initially (perhaps in git.rev_list) instead of having it look up extended information on a single commit by commit basis. I tried extracting extra information by passing in format=full on iter_commits but that caused an error in the parsing code. (assert len(hexsha) == 40, "Invalid line: %s" % hexsha)

Is there any way to do this efficiently? I totally understand if this is the sort of thing that GitPython isn't really optimizing for. I'm just resorting back to calling "git log" for now.

The text was updated successfully, but these errors were encountered:

Byron · 2016-01-30T10:48:23Z

For questions, I recommend using stackoverflow along with the GitPython tag, as you will obtain answers much faster, and from a bigger audience.

That said, GitPython is streaming all its data and indeed built for efficiency (as far as python allows), and I don't believe there is a better way to do what you are doing with GitPython.

However, what you can try is to switch the backend implementation, which might change the performance characteristics. As I don't know which version of GitPython you are using, and backend defaults changed, you might just want to try both backends explicitly:

import git
repo = git.Repo('.', odbt=git.GitCmdObjectDB);
# or 
repo = git.Repo('.', odbt=git.GitDB);

varenc · 2016-02-03T21:39:37Z

Thanks for your reply. I created an issue because I explored alternatives and saw that this was a limitation of GitPython. Choice of backend has no affect. GitPython is certainly built for efficiency in many ways but it does not seem built for efficiently extracting extended commit information from a very large repo.

If there's interest I could create a few scripts that clearly demonstrates this and compares GitPython performance with manually running a git log command and parsing the output, but this may well be out of scope.

varenc changed the title ~~GitPython doesn't have an efficient way to extract all information~~ GitPython doesn't have an efficient way to extract information from all commits Jan 26, 2016

Byron closed this as completed Jan 30, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GitPython doesn't have an efficient way to extract information from all commits #378

GitPython doesn't have an efficient way to extract information from all commits #378

varenc commented Jan 26, 2016

Byron commented Jan 30, 2016

varenc commented Feb 3, 2016

GitPython doesn't have an efficient way to extract information from all commits #378

GitPython doesn't have an efficient way to extract information from all commits #378

Comments

varenc commented Jan 26, 2016

Byron commented Jan 30, 2016

varenc commented Feb 3, 2016