bpo-17560: Too small type for struct.pack/unpack in mutliprocessing.Connection #10305

ahcub · 2018-11-03T08:03:26Z

https://bugs.python.org/issue17560

…nput instead

!I cpython/master -> ahcub/cpython/master

!I cpython/master -> ahcub/master

…ection

ahcub · 2018-11-03T08:17:20Z

side note: currently on Windows we use DWORD for multiprocessing.Process output size which is, as far as I understand, an equivalent of !Q

pitrou

Hi and thanks for trying to solve this. You're on the right track here, but we should maintain compatibility with older versions (one should be able to set up a Listener with a Python version and a Client with another Python version, though that's probably uncommon). One way to do that is as outlined in my message below (simply put, use a special value of the 32-bit size field to introduce a larger 64-bit size field):
https://bugs.python.org/issue17560#msg185345

Also, please change the issue number reference to the non-closed issue :)

bedevere-bot · 2018-11-03T09:32:07Z

A Python core developer has requested some changes be made to your pull request before we can consider merging it. If you could please address their requests along with any other requests in other reviews from core developers that would be appreciated.

Once you have made the requested changes, please leave a comment on this pull request containing the phrase I have made the requested changes; please review again. I will then notify any core developers who have left a review that you're ready for them to take another look at this pull request.

pitrou · 2018-11-03T09:33:42Z

side note: currently on Windows we use DWORD for multiprocessing.Process output size which is, as far as I understand, an equivalent of !Q

On Windows it's PipeConnection that's used by default. Unfortunately many Windows APIs (such as ReadFile) are not 64-bit clean internally. We would have to introduce message chunking. But that can be done in a later PR if there's a Windows developer that's interested in that.

ahcub · 2018-11-03T10:11:42Z

thanks for the note and description
I understand what you mean, but I'm not sure how this use case is possible. Can you show an example of how to setup different versions for multiprocessing Client and Listener?

pitrou · 2018-11-03T10:21:29Z

In one terminal:

Python 3.7.1 (default, Oct 23 2018, 19:19:42) 
[GCC 7.3.0] :: Anaconda, Inc. on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> from multiprocessing.connection import Listener    
>>> listener = Listener(address='127.0.0.1:3456')
>>> conn = listener.accept()  # will wait for inbound connection
>>> conn.send(b'foo')

In the other:

Python 3.6.5 (default, Apr  1 2018, 05:46:30) 
[GCC 7.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> from multiprocessing.connection import Client
>>> conn = Client(address='127.0.0.1:3456')
>>> conn.recv()   # will wait for incoming message
b'foo'

ahcub · 2018-11-03T10:35:20Z

thanks, I will work on that

ahcub · 2018-11-06T00:39:06Z

ok, so I gave -1 a try and it seems like it will not support the diff versions case as well.
the result is basically pretty much the same as if we connect a version with !Q change and the old one - connection hangs.
I was thinking about it, and I'm not sure if there is a solution that will support the different versions connection at all.

ahcub · 2018-11-06T00:45:58Z

ok, I guess you meant that with -1 approach will support the backward compatibility for cases when users don't hit the memory threshold of over 2GB

bedevere-bot · 2018-11-06T19:18:53Z

Thanks for making the requested changes!

@pitrou: please review the changes made to this pull request.

pitrou · 2018-11-06T19:38:03Z

Thank you @ahcub! I will merge this now.

ahcub · 2018-11-06T22:37:03Z

thanks!

yangyxt · 2020-05-24T05:29:20Z

I was stuck with python version < 3.8 since a tool I have to use is not compatible with higher version. So I tried to make the changes to the connection.py as instructed above. However, I found the overhead time is ridiculously big.

I'm working with pandas DataFrame. The table was so big so I have to read it in chunks. I did not use pd.read_csv and chunksize for some reason. Instead, I used a custom generator based on built-in read. Then I pass every chunk of df I read to a function to do some filtration of rows based on values of certain columns using pool.apply_async() wrapped in lambda and built-in map(). The reason I did not use map_async() is that I need to pass multiple parameters to the function.

Then I used some timestamp in the logging and found out the time between yielding a chunk dataframe and start executing the function is near 1 min. This is totally unacceptable. I have 10 million rows in the table in total, and every time my generator only yields 5000 lines. I tried the same script on a table with only 100 thousand rows and yield 5000 lines each time. And the time between yielding a chunk df and start executing the function is near 10 ms. Why the total size of the file I'm about to read with generator matters that much in this case. Why passing the 5000 lines table(only 15 columns) to a function takes that much time?

ahcub · 2020-05-24T08:07:42Z

Hi @yangyxt, I would love to help, but would probably need to see the code to advice anything meaningful.

yangyxt · 2020-05-24T11:22:02Z

Hi @yangyxt, I would love to help, but would probably need to see the code to advice anything meaningful.
Thanks. Glad to paste code here. This is how I implement mp.pool:

pool = mp.Pool(processes=(int(estimate_cpus()))) logger.warn("\n\nThe number of usable CPUs is: " + str(int(estimate_cpus()))) # Read the large bam using self-defined iterator based on pysam, yielding chunk dfs iterator = read_bam_qname_groups(bam, chunksize=chunksize, headers=headers, sep='\t') output = map(lambda chunk: pool.apply_async(func_return_unit_df, args=(chunk, *func_args)), iterator) results = map(lambda r: r.get(), output)

Notice read_bam_qname_groups() is a self-defined generator to read big tables in chunks, yielding 5000 lines at each time.

And func_return_unit_df is a placeholder for functions used to process that 5000 line chunk pandas dataframe. Since I need to implement *args here so I have to use built-in map() and lambda and pool.apply_async() instead of just using pool.map_async().

This is how I process after getting the "results" iterator.
chunk_df = next(results) logger.info("From {}, the returned result df's shape is ".format(func_return_unit_df.__name__) + str(chunk_df.shape) + "\n") chunk_df.to_csv(output_path, sep='\t', index=False, header=False, mode='a', encoding='utf-8')

Therefore, the workflow should be reading a big file into chunks, pass each chunk to func_return_unit_df, then output the processed result into an output file in appending mode. Doing this will prevent the script eating too much memory.

Furthermore, I used a logging command in self-defined "read_bam_qname_groups":
logger.info("We check the next the row and its qname is different from returned table's last row, the yielding table shape is: " + str(chunk_df.shape) + str(chunk_df.head())) yield chunk_df
Before yielding chunk_df, I output a logging info with a timestamp.

Then I put a logging command at the begining of func_return_unit_df.
def fetch_multi_with_boolarray(chunk, uqname): logger.info("The input chunk shape is: " + str(chunk.shape))
Once the function start executing, the logging info is output with a timestamp.

Here is the key part, I test the code in a table with 10 million rows
and I noticed that the interval between two timestamps is near 1 min. And I test the code in a table with 100 thousand rows and the interval between two timestamps is around 20 ms.
I don't know why this happens. This is really important to me. Pls let me know your opinions. Thanks!

ahcub · 2020-05-24T14:39:44Z

the first thing that I would recommend is to increase the chunk size to something like 100k lines per chunk, since you are most likely will waste more time on the data transfer than the parsing of that chunk.

but even with 5k rows per chunk, I'm not sure what is going on there that makes it run for a minute.
I made a few tests on my local machine with 10million rows and 15 columns and with multiproc it finishes in 6 seconds if I chunk data by 100k rows and 9 seconds if I chunk by 5k rows.

another thing is that if you have files only 10million rows long and 15 columns wide you can simply read the file as is and don't do chunking at all. on my machine it finishes in 11seconds which doesn't seem long.

I can share the code I used for tests, so maybe it can be helpful for you

def process(data):
    df = pd.read_csv(data)
    return (df.shape, df.sum().sum())

chunk_size = 10**5
if __name__ == '__main__':
    with open('file.csv', 'rb') as file:
        content_lines = file.read().splitlines(keepends=True)
        print(len(content_lines))
        with Pool(4) as pool:
            start = datetime.now()
            tasks = []
            for i in range(0, len(content_lines), chunk_size):
                print('starting task', i)
                tasks.append(pool.apply_async(process, (BytesIO(b''.join(content_lines[i:i+chunk_size])),)))
            for task in tasks:
                print(task.get())
            print(datetime.now() - start)

not sure if I answered your question or not, but I hope this is helpful.
let me know if you have other questions.

yangyxt · 2020-05-25T02:22:19Z

the first thing that I would recommend is to increase the chunk size to something like 100k lines per chunk, since you are most likely will waste more time on the data transfer than the parsing of that chunk.

but even with 5k rows per chunk, I'm not sure what is going on there that makes it run for a minute.
I made a few tests on my local machine with 10million rows and 15 columns and with multiproc it finishes in 6 seconds if I chunk data by 100k rows and 9 seconds if I chunk by 5k rows.

another thing is that if you have files only 10million rows long and 15 columns wide you can simply read the file as is and don't do chunking at all. on my machine it finishes in 11seconds which doesn't seem long.

I can share the code I used for tests, so maybe it can be helpful for you
def process(data):
    df = pd.read_csv(data)
    return (df.shape, df.sum().sum())

chunk_size = 10**5
if __name__ == '__main__':
    with open('file.csv', 'rb') as file:
        content_lines = file.read().splitlines(keepends=True)
        print(len(content_lines))
        with Pool(4) as pool:
            start = datetime.now()
            tasks = []
            for i in range(0, len(content_lines), chunk_size):
                print('starting task', i)
                tasks.append(pool.apply_async(process, (BytesIO(b''.join(content_lines[i:i+chunk_size])),)))
            for task in tasks:
                print(task.get())
            print(datetime.now() - start)
not sure if I answered your question or not, but I hope this is helpful.
let me know if you have other questions.

Dear ahcub,
Thank you so much for spending that time doing tests for my issue!
To clarify, I not only need to use this script to process tables with 10 million rows but also need it to process 100 million rows. So I have to read the tables in chunks.

I also had no clue about this issue. I found an article here https://thelaziestprogrammer.com/python/a-multiprocessing-pool-pickle but I don't have enough background knowledge to understand it. Does this remind you of anything?

Another thing confuses me a lot is if 5000 rows dataframe is small, why passing 5000 rows in pool object will trigger this error? https://stackoverflow.com/questions/47776486/python-struct-error-i-format-requires-2147483648-number-2147483647

BTW, I ran this python script using the PBS system on an HPC facility. The process number of the pool object is the vacant CPU number minus 1.

Thanks again for your time!

ahcub · 2020-05-25T07:45:50Z

Another thing confuses me a lot is if 5000 rows dataframe is small, why passing 5000 rows in pool object will trigger this error? https://stackoverflow.com/questions/47776486/python-struct-error-i-format-requires-2147483648-number-2147483647

this is probably because the python version on the system is not updated.

but to answer how 5k rows can be more than 2GB big I would probably need to see an example of the data.

For 100mil rows or any other amount, I guess I would recommend making chunks of 100-500 MB big (unzipped).
And if you are interested in processing huge amounts of data on a regular basis tho, I would recommend trying to find a solution for dealing with big data on a cluster.

if you want we can go through the code and the data on the call, and try to fix an issue you are having, together.
you can find my email on my github profile page.

yangyxt · 2020-05-25T11:14:11Z

Another thing confuses me a lot is if 5000 rows dataframe is small, why passing 5000 rows in pool object will trigger this error? https://stackoverflow.com/questions/47776486/python-struct-error-i-format-requires-2147483648-number-2147483647

this is probably because the python version on the system is not updated.

but to answer how 5k rows can be more than 2GB big I would probably need to see an example of the data.

For 100mil rows or any other amount, I guess I would recommend making chunks of 100-500 MB big (unzipped).
And if you are interested in processing huge amounts of data on a regular basis tho, I would recommend trying to find a solution for dealing with big data on a cluster.

if you want we can go through the code and the data on the call, and try to fix an issue you are having, together.
you can find my email on my github profile page.

Dear ahcub,
Thank you for being willing to solve this issue with me. I'll find send an email to u and show you the code and the data. Thanks again!

ahcub added 13 commits September 1, 2018 19:50

Fix for invalid assert on big output of multiprocessing.Process

19481ec

add the news

95bfc95

remove the changes from generated file

c622aae

Remove changes in the generated part and made changes in the clinic i…

5c8ffcb

…nput instead

!U add test for big output multiprocessing check

31c90f9

!U commit changes in the generated files

a1b8f7b

!U remove test for big output because it is only for manual use

26c88c4

!U remove the manual change

4e5026f

!U fix the generated code parts

0d31498

!U changed the News entry

92f8ece

Merge pull request #1 from python/master

c54bb14

!I cpython/master -> ahcub/cpython/master

Merge pull request #2 from python/master

c11a1d7

!I cpython/master -> ahcub/master

[35152] too small type for struct.pack/unpack in mutliprocessing.Conn…

6dd800b

…ection

the-knights-who-say-ni added the CLA signed label Nov 3, 2018

bedevere-bot added the awaiting review label Nov 3, 2018

!U add news entry

5744461

ahcub added 2 commits November 3, 2018 11:03

!U change the news entry

d437efa

!U change the news entry

e4551dd

pitrou requested changes Nov 3, 2018

View reviewed changes

bedevere-bot added awaiting changes and removed awaiting review labels Nov 3, 2018

serhiy-storchaka changed the title ~~bpo-35152 too small type for struct.pack/unpack in mutliprocessing.Connection~~ bpo-17560: Too small type for struct.pack/unpack in mutliprocessing.Connection Nov 5, 2018

bedevere-bot added awaiting change review and removed awaiting changes labels Nov 6, 2018

pitrou approved these changes Nov 6, 2018

View reviewed changes

bedevere-bot added awaiting merge and removed awaiting change review labels Nov 6, 2018

pitrou merged commit bccacd1 into python:master Nov 6, 2018

bedevere-bot removed the awaiting merge label Nov 6, 2018

sergpolly mentioned this pull request Nov 24, 2018

pickles break while "multiprocess"-ing dots open2c/cooltools#50

Closed

londumas mentioned this pull request Jan 10, 2019

Rectangle distortion matrix igmhub/picca#528

Merged

Hoohm mentioned this pull request Jul 22, 2019

Two types of errors, that seem related to number of threads Hoohm/CITE-seq-Count#69

Closed

ardunn mentioned this pull request Oct 10, 2019

Multiprocessing error when featurizing many band structure objects hackingmaterials/matminer#417

Closed

simonvh mentioned this pull request Nov 8, 2019

gimme maelstrom crashes vanheeringen-lab/gimmemotifs#81

Closed

calum-chamberlain mentioned this pull request Nov 18, 2019

catalog_to_dd.write_correlations struct.error eqcorrscan/EQcorrscan#360

Closed

gmrukwa mentioned this pull request Dec 9, 2019

Fixup: multiprocessing data passing gmrukwa/divik#16

Merged

prateek-77 mentioned this pull request Mar 27, 2020

struct.error: 'i' format requires -2147483648 <= number <= 2147483647 open-mmlab/mmdetection#2044

Closed

Huanle mentioned this pull request Sep 28, 2020

long-read data causing memory/batch issue novoalab/EpiNano#59

Closed

tomMoral mentioned this pull request Oct 7, 2020

struct.error when each process sending back a large object joblib/joblib#1113

Closed

marcus1487 mentioned this pull request Dec 16, 2020

Aggregation step is paused shortly after beginning nanoporetech/megalodon#76

Closed

michalk8 mentioned this pull request Jan 12, 2021

[FEATURE NAME] cr.pl.cluster_lineage throws MaybeEncodingError: Error sending result with many cells theislab/cellrank#460

Closed

mkuchnik mentioned this pull request Apr 7, 2021

[Dataset Creation] Python Bug: 'i' format requires -2147483648 <= number <= 2147483647 facebookresearch/dlrm#172

Closed

Uh oh!

bpo-17560: Too small type for struct.pack/unpack in mutliprocessing.Connection #10305

bpo-17560: Too small type for struct.pack/unpack in mutliprocessing.Connection #10305

Uh oh!

Conversation

ahcub commented Nov 3, 2018 • edited by serhiy-storchaka Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ahcub commented Nov 3, 2018

Uh oh!

pitrou left a comment

Choose a reason for hiding this comment

Uh oh!

bedevere-bot commented Nov 3, 2018

Uh oh!

pitrou commented Nov 3, 2018

Uh oh!

ahcub commented Nov 3, 2018

Uh oh!

pitrou commented Nov 3, 2018

Uh oh!

ahcub commented Nov 3, 2018

Uh oh!

ahcub commented Nov 6, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ahcub commented Nov 6, 2018

Uh oh!

bedevere-bot commented Nov 6, 2018

Uh oh!

pitrou commented Nov 6, 2018

Uh oh!

ahcub commented Nov 6, 2018

Uh oh!

yangyxt commented May 24, 2020

Uh oh!

ahcub commented May 24, 2020

Uh oh!

yangyxt commented May 24, 2020

Uh oh!

ahcub commented May 24, 2020

Uh oh!

yangyxt commented May 25, 2020

Uh oh!

ahcub commented May 25, 2020

Uh oh!

yangyxt commented May 25, 2020

Uh oh!

Uh oh!

ahcub commented Nov 3, 2018 •

edited by serhiy-storchaka

Loading

ahcub commented Nov 6, 2018 •

edited

Loading