-
-
Notifications
You must be signed in to change notification settings - Fork 32k
bpo-17560: Too small type for struct.pack/unpack in mutliprocessing.Connection #10305
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
!I cpython/master -> ahcub/cpython/master
!I cpython/master -> ahcub/master
side note: currently on Windows we use DWORD for multiprocessing.Process output size which is, as far as I understand, an equivalent of !Q |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi and thanks for trying to solve this. You're on the right track here, but we should maintain compatibility with older versions (one should be able to set up a Listener with a Python version and a Client with another Python version, though that's probably uncommon). One way to do that is as outlined in my message below (simply put, use a special value of the 32-bit size field to introduce a larger 64-bit size field):
https://bugs.python.org/issue17560#msg185345
Also, please change the issue number reference to the non-closed issue :)
A Python core developer has requested some changes be made to your pull request before we can consider merging it. If you could please address their requests along with any other requests in other reviews from core developers that would be appreciated. Once you have made the requested changes, please leave a comment on this pull request containing the phrase |
On Windows it's |
thanks for the note and description |
In one terminal:
In the other:
|
thanks, I will work on that |
ok, so I gave -1 a try and it seems like it will not support the diff versions case as well. |
ok, I guess you meant that with -1 approach will support the backward compatibility for cases when users don't hit the memory threshold of over 2GB |
Thanks for making the requested changes! @pitrou: please review the changes made to this pull request. |
Thank you @ahcub! I will merge this now. |
thanks! |
I was stuck with python version < 3.8 since a tool I have to use is not compatible with higher version. So I tried to make the changes to the connection.py as instructed above. However, I found the overhead time is ridiculously big. I'm working with pandas DataFrame. The table was so big so I have to read it in chunks. I did not use pd.read_csv and chunksize for some reason. Instead, I used a custom generator based on built-in read. Then I pass every chunk of df I read to a function to do some filtration of rows based on values of certain columns using pool.apply_async() wrapped in lambda and built-in map(). The reason I did not use map_async() is that I need to pass multiple parameters to the function. Then I used some timestamp in the logging and found out the time between yielding a chunk dataframe and start executing the function is near 1 min. This is totally unacceptable. I have 10 million rows in the table in total, and every time my generator only yields 5000 lines. I tried the same script on a table with only 100 thousand rows and yield 5000 lines each time. And the time between yielding a chunk df and start executing the function is near 10 ms. Why the total size of the file I'm about to read with generator matters that much in this case. Why passing the 5000 lines table(only 15 columns) to a function takes that much time? |
Hi @yangyxt, I would love to help, but would probably need to see the code to advice anything meaningful. |
Notice read_bam_qname_groups() is a self-defined generator to read big tables in chunks, yielding 5000 lines at each time. And func_return_unit_df is a placeholder for functions used to process that 5000 line chunk pandas dataframe. Since I need to implement *args here so I have to use built-in map() and lambda and pool.apply_async() instead of just using pool.map_async(). This is how I process after getting the "results" iterator. Therefore, the workflow should be reading a big file into chunks, pass each chunk to func_return_unit_df, then output the processed result into an output file in appending mode. Doing this will prevent the script eating too much memory. Furthermore, I used a logging command in self-defined "read_bam_qname_groups": Then I put a logging command at the begining of func_return_unit_df. Here is the key part, I test the code in a table with 10 million rows |
the first thing that I would recommend is to increase the chunk size to something like 100k lines per chunk, since you are most likely will waste more time on the data transfer than the parsing of that chunk. but even with 5k rows per chunk, I'm not sure what is going on there that makes it run for a minute. another thing is that if you have files only 10million rows long and 15 columns wide you can simply read the file as is and don't do chunking at all. on my machine it finishes in 11seconds which doesn't seem long. I can share the code I used for tests, so maybe it can be helpful for you
not sure if I answered your question or not, but I hope this is helpful. |
Dear ahcub, I also had no clue about this issue. I found an article here https://thelaziestprogrammer.com/python/a-multiprocessing-pool-pickle but I don't have enough background knowledge to understand it. Does this remind you of anything? Another thing confuses me a lot is if 5000 rows dataframe is small, why passing 5000 rows in pool object will trigger this error? https://stackoverflow.com/questions/47776486/python-struct-error-i-format-requires-2147483648-number-2147483647 BTW, I ran this python script using the PBS system on an HPC facility. The process number of the pool object is the vacant CPU number minus 1. Thanks again for your time! |
this is probably because the python version on the system is not updated. but to answer how 5k rows can be more than 2GB big I would probably need to see an example of the data. For 100mil rows or any other amount, I guess I would recommend making chunks of 100-500 MB big (unzipped). if you want we can go through the code and the data on the call, and try to fix an issue you are having, together. |
Dear ahcub, |
https://bugs.python.org/issue17560