You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I'm using GitPython to extract git commit log information from a large repo. Think 30,000+ commits.
If I do something like this...
repo = Repo('...')
commits = list(repo.iter_commits('master'))
for commit in commits:
yield commit.author.name
GitPython will end up calling out to the git once to get the list of commits, and then again for every single commit message. This makes it take very long. It'd be more efficient if there was a way to instruct GitPython to extract extended commit information from a list of commits initially (perhaps in git.rev_list) instead of having it look up extended information on a single commit by commit basis. I tried extracting extra information by passing in format=full on iter_commits but that caused an error in the parsing code. (assert len(hexsha) == 40, "Invalid line: %s" % hexsha)
Is there any way to do this efficiently? I totally understand if this is the sort of thing that GitPython isn't really optimizing for. I'm just resorting back to calling "git log" for now.
The text was updated successfully, but these errors were encountered:
varenc
changed the title
GitPython doesn't have an efficient way to extract all information
GitPython doesn't have an efficient way to extract information from all commits
Jan 26, 2016
For questions, I recommend using stackoverflow along with the GitPython tag, as you will obtain answers much faster, and from a bigger audience.
That said, GitPython is streaming all its data and indeed built for efficiency (as far as python allows), and I don't believe there is a better way to do what you are doing with GitPython.
However, what you can try is to switch the backend implementation, which might change the performance characteristics. As I don't know which version of GitPython you are using, and backend defaults changed, you might just want to try both backends explicitly:
importgitrepo=git.Repo('.', odbt=git.GitCmdObjectDB);
# or repo=git.Repo('.', odbt=git.GitDB);
Thanks for your reply. I created an issue because I explored alternatives and saw that this was a limitation of GitPython. Choice of backend has no affect. GitPython is certainly built for efficiency in many ways but it does not seem built for efficiently extracting extended commit information from a very large repo.
If there's interest I could create a few scripts that clearly demonstrates this and compares GitPython performance with manually running a git log command and parsing the output, but this may well be out of scope.
I'm using GitPython to extract git commit log information from a large repo. Think 30,000+ commits.
If I do something like this...
GitPython will end up calling out to the git once to get the list of commits, and then again for every single commit message. This makes it take very long. It'd be more efficient if there was a way to instruct GitPython to extract extended commit information from a list of commits initially (perhaps in
git.rev_list
) instead of having it look up extended information on a single commit by commit basis. I tried extracting extra information by passing informat=full
oniter_commits
but that caused an error in the parsing code. (assert len(hexsha) == 40, "Invalid line: %s" % hexsha
)Is there any way to do this efficiently? I totally understand if this is the sort of thing that GitPython isn't really optimizing for. I'm just resorting back to calling "git log" for now.
The text was updated successfully, but these errors were encountered: