-
-
Notifications
You must be signed in to change notification settings - Fork 933
Some files are never closed #60
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Thanks for reporting this. I suffered from similar problems in windows tests (where open handles prevented the test-cases to clean themselves up properly), and this definitely should be looked into. Until then, you might use the GitCmdDb Backend for your repository, which uses a different implementation. |
I can confirm the issue on fedora 17, there is *.pack and *.idx, open as file and memory mapped. |
I think, the lsof checks will have to go into the test-suite too, in order to protect against regression. Especially with something as vital (and limited) as system resources, I shouldn't leave anything to chance. Currently I have no ETA on this, but I am planning to redirect development time to git-python once again. |
So, learning my way through the code with In if isinstance(path_or_fd, int):
fd = path_or_fd
else:
fd = os.open(path_or_fd, os.O_RDONLY|getattr(os, 'O_BINARY', 0)|flags)
#END handle fd In the case of this bug report, Then later on, in the same function, I can see: finally:
if isinstance(path_or_fd, basestring):
os.close(fd)
#END only close it if we opened it And here, the file descriptor 3 is closed. So everything is good, right? No, because in between, there is this: self._mf = mmap(fd, actual_size, **kwargs) This maps the content of the file pointed by file descriptor 3 in memory... and it also opens the file again, as file descriptor 4! It is this one which never gets closed. Now let's consider the following: import mmap
import os
raw_input("Check open files with `lsof -p %s | grep %s`" % (os.getpid(),
os.getcwd()))
f = open("hello", "r+b")
raw_input("Check open files again (opened 'hello' as fd %s)" % f.fileno())
mf = mmap.mmap(f.fileno(), 0, access=mmap.ACCESS_READ)
raw_input("Check open files again (mmaped fd %s)" % f.fileno())
f.close()
raw_input("Check open files again (closed 'hello')")
mf.close()
raw_input("Check open files again (closed the mmap)") This reproduces exactly the same behaviour: a file is opened as fd 3, when mmap-ing its content we also open the same file as fd 4, then we close the fd 3, and finally closing the map will also close fd 4. So what is missing from |
Thank you for making the effort of pointing the course of the issue out so clearly. In any way, the memory map needs to remain open until the last user of the respective memory region fades away, and apparently this doesn't happen, maybe because someone keeps it open by keeping a reference to it - this at least would be my guess. AFAIK, when the mmap instance goes out of scope, it will close automatically, maybe I am relying on this. Also, smmap has a known issue with reference counting in some python versions apparently, which may also lead to the issue we see here. |
I'd certainly appreciate it. :) Especially since it's not even possible to manually keep track of those files (by monkey-patching Python's
I tried spending some time yesterday figuring where this should be, but that's just too much for me.
Well, I tried destroying the repo object, but that didn't help. If a reference is kept on it, I really wonder where. :-/
So I can confirm the issue on RHEL 6 (Python 2.6.6) and Fedora 16 (Python 2.7.2) And mscherer above confirmed it on Fedora 17, which has Python 2.7.3. So I'm not sure that's related to a Python version, or else we are extremely unlucky. :D |
When I implemented the reference counting, I decided to try using Python's internal one, with sys.getrefcount(). However, I realized that I had to put some magic numbers into the code to account for references to my object that I didn't really have on my radar. In that moment, I might have opened the gates for plenty of bugs. Or I have a cyclic relation somewhere, even though I was very conscious about strong references and about who keeps pointers to whom. Anyway, its buggy, and a real problem as GitPython uses this engine heavily under the hood, and by default. |
Any news on this issue? |
No, sorry. Also I cannot currently assign any time to the project. |
Any news? I'm getting the same error on Python 2.6.6 on Debian. |
Wow, exactly one year ago I said I had no time, and sadly this still didn't change. However, next year I could possibly get to it, there are many things I want to improve and do with git-python. |
That sounds great. Here's a simplified version of the code that caused the error:
After about 500 iterations it crashes with
Is there any way of getting the lastest commit from a directory or a file on a certain branch or tag besides using iter_commits ? I'd rather use objects instead of parsing raw git output and making my own dictionary with the values. |
@virgiliu My personal workaround is to run the GitPython stuff (at least the ones which lead to this bug) in different processes. :-/ |
@bochecha could you show me an example of that? |
@virgiliu: sure. I just hope @Byron doesn't mind that we're using his bug tracker to discuss like this. :) So, in one of my projects I'm using the following solution. I haven't pushed it yet (mostly because I'm not entirely satisfied with it) so I can't just give you a link to the code. You can assume that the code below is MIT-licensed. So first, I have the following decorator: class subprocessed(object):
"""Decorator to run a function in a subprocess
We can't use GitPython directly because it leaks memory and file
descriptors:
https://github.com/gitpython-developers/GitPython/issues/60
Until it is fixed, we have to use multiple processes :(
"""
def __init__(self, func):
self.func = func
def __get__(self, instance, *args, **kwargs):
self.instance = instance
return self
def __call__(self, *args):
from multiprocessing import Process, Queue
q = Queue()
def wrapper(queue, func, *args):
queue.put(func(*args))
p = Process(target=wrapper,
args=(q, self.func, self.instance) + args)
p.start()
p.join()
return q.get() Then, you just need to decorate a method with it, and it gets run as a subprocess! In my case, I have a class class Foobar(object):
def __init__(self):
self.git_rooturl = "git://git.example.com"
self.workdir = "/path/to/where/the/repo/will/be/cloned"
def do_stuff_on_repo(self, reponame):
curdir = os.getcwd()
# Clone the module
repo = git.Repo.clone_from("%s/%s" % (self.git_rooturl, reponame),
os.path.join(self.workdir, reponame))
os.chdir(repo.working_tree_dir)
# Do your stuff here
result = ...
os.chdir(curdir)
shutil.rmtree(repo.working_tree_dir)
del repo
return result Unfortunately the As a result, I'm just defining it in this way: @subprocessed
def do_stuff_on_repo(self, reponame):
... All I changed is that I decorated it, and now each call to the function will be run in its own process, so the leaking memory and fds get cleared when the process exits. The only tricky part which is not handled above is what to do in case of an error: if How to handle that properly is left as an exercise to the reader. (because that's the part I'm not very happy with :P) |
@bochecha Great stuff. 👍 |
Bumping this. I just spent a ton of time tracking this down. It's still there :-) |
+1 |
This is definitely making gitpython unusable for projects like grokmirror that must be able to operate on thousands of repositories. As this bug is now 2+ years old, it doesn't look like it'll be fixed at this point. :( |
Ahhh, I am having the same issue. Would be nice to see a fix. :) |
Is anyone working on this bug? |
The issue presented by @virgiliu seems to be fixed now (with the latest version of smmap 0.9.0), but by artificially constraining the file limit to 64, I can still see that pipes are leaked. |
As For example, it's possible to run all tests with |
It turned out I forgot to add a tear-down super class call, which was responsible for most of the test-leakage. |
Fantastic, thanks! |
I am still seeing this problem in version 1.0.1. Here is a little test script: import git
repo = git.Repo('./')
for k in repo.iter_commits(max_count=25):
print k At the same time, look at the number of open files with |
I think it will work if you use the latest In If this works for you, I'd be glad if you could let me know here as well. |
I have some reports of the same issue in earwig/git-repo-updater#9, I think. Will ask people if GitPython HEAD fixes it. |
using GitCmdObjectDB when initiating repo(suggestion of library creator; gitpython-developers/GitPython#60 (comment) ; for the future may be better to use newer version of the library)
I'm working on a script using GitPython to loop over lots of Git repos, pull remote changes, etc...
At some point, I noticed that doing
repo.is_dirty()
was opening some files which never got closed until the process exits, which in turns causes the tool to crash with "too many open files" after iterating over enough repos:I tried digging deeper down the GitPython stack, but couldn't find the actual cause.
In case that's helpful, below is the same as the previous snippet, but using directly the lowest-level gitdb objects I could find to open the files:
But that's just where the files are open (well, in fact they are open even lower down but I didn't manage to go deeper), not where they are supposed to be closed.
The text was updated successfully, but these errors were encountered: