Skip to content

RuntimeError: Could not build empty array #392

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
sirfz opened this issue Apr 30, 2025 · 16 comments
Closed

RuntimeError: Could not build empty array #392

sirfz opened this issue Apr 30, 2025 · 16 comments

Comments

@sirfz
Copy link

sirfz commented Apr 30, 2025

Describe the bug
I'm encountering a runtime error when trying to create an array of shape (17707749, 768):

RuntimeError Traceback (most recent call last)
Cell In[9], line 1
----> 1 blosc2.empty(shape=(17707749, 768), dtype=np.float32)

File .venv/lib/python3.11/site-packages/blosc2/ndarray.py:2978, in empty(shape, dtype, **kwargs)
2976 blocks = kwargs.pop("blocks", None)
2977 chunks, blocks = compute_chunks_blocks(shape, chunks, blocks, dtype, **kwargs)
-> 2978 return blosc2_ext.empty(shape, chunks, blocks, dtype, **kwargs)

File blosc2_ext.pyx:2706, in blosc2.blosc2_ext.empty()

File blosc2_ext.pyx:2233, in blosc2.blosc2_ext._check_rc()

RuntimeError: Could not build empty array

To Reproduce

import numpy as np
import blosc2

blosc2.empty(shape=(17707749, 768), dtype=np.float32)

Expected behavior
array to be created

Desktop (please complete the following information):

  • OS: Ubuntu 24.04 (x86_64)
  • Version 3.3.1
@FrancescAlted
Copy link
Member

FrancescAlted commented Apr 30, 2025

Interesting. Your code works fine for me for an array of more than 4 Petabytes on Linux (in about 10s):

Blosc2 version: 3.3.1
a.info:
 type    : NDArray
shape   : (17707749000, 76800)
chunks  : (75, 76800)
blocks  : (1, 38400)
dtype   : float32
cratio  : 720000.00
cparams : CParams(codec=<Codec.ZSTD: 5>, codec_meta=0, clevel=1, use_dict=False, typesize=4,
        : nthreads=16, blocksize=153600, splitmode=<SplitMode.AUTO_SPLIT: 3>,
        : filters=[<Filter.NOFILTER: 0>, <Filter.NOFILTER: 0>, <Filter.NOFILTER: 0>,
        : <Filter.NOFILTER: 0>, <Filter.NOFILTER: 0>, <Filter.SHUFFLE: 1>], filters_meta=[0, 0,
        : 0, 0, 0, 0], tuner=<Tuner.STUNE: 0>)
dparams : DParams(nthreads=16)

	Command being timed: "python python-blosc2/prova.py"
	User time (seconds): 6.83
	System time (seconds): 3.41
	Percent of CPU this job got: 109%
	Elapsed (wall clock) time (h:mm:ss or m:ss): 0:09.38
	Average shared text size (kbytes): 0
	Average unshared data size (kbytes): 0
	Average stack size (kbytes): 0
	Average total size (kbytes): 0
	Maximum resident set size (kbytes): 12970412
	Average resident set size (kbytes): 0
	Major (requiring I/O) page faults: 0
	Minor (reclaiming a frame) page faults: 3242421
	Voluntary context switches: 192
	Involuntary context switches: 325
	Swaps: 0
	File system inputs: 8
	File system outputs: 0
	Socket messages sent: 0
	Socket messages received: 0
	Signals delivered: 0
	Page size (bytes): 4096
	Exit status: 0

As can be seen, only 12 GB of system memory is used.

My code:

import numpy as np
import blosc2

print("Blosc2 version:", blosc2.__version__)
a = blosc2.empty(shape=(17707749 * 1_000, 768 * 100), dtype=np.float32)
print("a.info:\n", a.info)

@FrancescAlted
Copy link
Member

FrancescAlted commented Apr 30, 2025

FWIW, the main bottleneck is allocating memory for sparse storage here. But if you may want to use a contiguous storage instead, and you will be able to create up to 38 Petabytes in less than a second (and consuming just 58 MB of RAM):

import numpy as np
import blosc2

print("Blosc2 version:", blosc2.__version__)
a = blosc2.empty(shape=(17707749 * 1_000, 768 * 800), dtype=np.float32, contiguous=True)
# Storing to disk is contiguous by default (this should take around 240 bytes on-disk)
# a = blosc2.empty(shape=(17707749 * 1_000, 768 * 800), dtype=np.float32, urlpath="a.b2nd", mode="w")
print("a.info:\n", a.info)
Blosc2 version: 3.3.1
a.info:
 type    : NDArray
shape   : (17707749000, 614400)
chunks  : (10, 614400)
blocks  : (1, 61440)
dtype   : float32
cratio  : 0.00
cparams : CParams(codec=<Codec.ZSTD: 5>, codec_meta=0, clevel=1, use_dict=False, typesize=4,
        : nthreads=16, blocksize=245760, splitmode=<SplitMode.AUTO_SPLIT: 3>,
        : filters=[<Filter.NOFILTER: 0>, <Filter.NOFILTER: 0>, <Filter.NOFILTER: 0>,
        : <Filter.NOFILTER: 0>, <Filter.NOFILTER: 0>, <Filter.SHUFFLE: 1>], filters_meta=[0, 0,
        : 0, 0, 0, 0], tuner=<Tuner.STUNE: 0>)
dparams : DParams(nthreads=16)

	Command being timed: "python python-blosc2/prova.py"
	User time (seconds): 0.98
	System time (seconds): 0.01
	Percent of CPU this job got: 757%
	Elapsed (wall clock) time (h:mm:ss or m:ss): 0:00.13
	Average shared text size (kbytes): 0
	Average unshared data size (kbytes): 0
	Average stack size (kbytes): 0
	Average total size (kbytes): 0
	Maximum resident set size (kbytes): 59020
	Average resident set size (kbytes): 0
	Major (requiring I/O) page faults: 0
	Minor (reclaiming a frame) page faults: 14457
	Voluntary context switches: 136
	Involuntary context switches: 59
	Swaps: 0
	File system inputs: 0
	File system outputs: 0
	Socket messages sent: 0
	Socket messages received: 0
	Signals delivered: 0
	Page size (bytes): 4096
	Exit status: 0

@sirfz
Copy link
Author

sirfz commented Apr 30, 2025

The shapes you tried work for me too but shape (17707749, 768) is failing for some reason:

In [1]: import numpy as np
   ...: import blosc2
   ...:
   ...: print("Blosc2 version:", blosc2.__version__)
   ...: a = blosc2.empty(shape=(17707749 * 1_000, 768 * 100), dtype=np.float32)
   ...: print("a.info:\n", a.info)
Blosc2 version: 3.3.1
a.info:
 type    : NDArray
shape   : (17707749000, 76800)
chunks  : (5975, 76800)
blocks  : (1, 25600)
dtype   : float32
cratio  : 57360000.00
cparams : CParams(codec=<Codec.ZSTD: 5>, codec_meta=0, clevel=1, use_dict=False, typesize=4,
        : nthreads=56, blocksize=102400, splitmode=<SplitMode.AUTO_SPLIT: 3>,
        : filters=[<Filter.NOFILTER: 0>, <Filter.NOFILTER: 0>, <Filter.NOFILTER: 0>,
        : <Filter.NOFILTER: 0>, <Filter.NOFILTER: 0>, <Filter.SHUFFLE: 1>], filters_meta=[0, 0,
        : 0, 0, 0, 0], tuner=<Tuner.STUNE: 0>)
dparams : DParams(nthreads=56)


In [2]: a = blosc2.empty(shape=(17707749 * 1_000, 768), dtype=np.float32)

In [3]: a = blosc2.empty(shape=(17707749, 768), dtype=np.float32)
---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
Cell In[3], line 1
----> 1 a = blosc2.empty(shape=(17707749, 768), dtype=np.float32)

File .venv/lib/python3.12/site-packages/blosc2/ndarray.py:2978, in empty(shape, dtype, **kwargs)
   2976 blocks = kwargs.pop("blocks", None)
   2977 chunks, blocks = compute_chunks_blocks(shape, chunks, blocks, dtype, **kwargs)
-> 2978 return blosc2_ext.empty(shape, chunks, blocks, dtype, **kwargs)

File blosc2_ext.pyx:2706, in blosc2.blosc2_ext.empty()

File blosc2_ext.pyx:2233, in blosc2.blosc2_ext._check_rc()

RuntimeError: Could not build empty array

In [4]: a = blosc2.empty(shape=(17707749 * 1_000, 768 * 800), dtype=np.float32, contiguous=True)

In [5]:

@FrancescAlted
Copy link
Member

Ok. I cannot reproduce this on a couple of Linux boxes I have. I guess this is related with the partition that has been computed for your box, which depends on the size of the caches of the CPU. Can you post the output of this?

import blosc2
import pprint

print("Blosc2 version:", blosc2.__version__)
print("Blosc2 cpu_info:")
pprint.pprint(blosc2.cpu_info)

@sirfz
Copy link
Author

sirfz commented Apr 30, 2025

Blosc2 version: 3.3.1
Blosc2 cpu_info:
{'arch': 'X86_64',
 'arch_string_raw': 'x86_64',
 'bits': 64,
 'brand_raw': 'AMD EPYC 7R32',
 'count': 192,
 'cpuinfo_version': [9, 0, 0],
 'cpuinfo_version_string': '9.0.0',
 'family': 23,
 'flags': ['3dnowprefetch',
           'abm',
           'adx',
           'aes',
           'aperfmperf',
           'apic',
           'arat',
           'avx',
           'avx2',
           'bmi1',
           'bmi2',
           'clflush',
           'clflushopt',
           'clwb',
           'clzero',
           'cmov',
           'cmp_legacy',
           'constant_tsc',
           'cpuid',
           'cr8_legacy',
           'cx16',
           'cx8',
           'de',
           'extd_apicid',
           'f16c',
           'fma',
           'fpu',
           'fsgsbase',
           'fxsr',
           'fxsr_opt',
           'ht',
           'hypervisor',
           'ibpb',
           'ibrs',
           'lahf_lm',
           'lm',
           'mca',
           'mce',
           'misalignsse',
           'mmx',
           'mmxext',
           'monitor',
           'movbe',
           'msr',
           'mtrr',
           'nonstop_tsc',
           'nopl',
           'npt',
           'nrip_save',
           'nx',
           'pae',
           'pat',
           'pclmulqdq',
           'pdpe1gb',
           'perfctr_core',
           'pge',
           'pni',
           'popcnt',
           'pse',
           'pse36',
           'rdpid',
           'rdpru',
           'rdrand',
           'rdseed',
           'rdtscp',
           'rep_good',
           'sep',
           'sha_ni',
           'smap',
           'smep',
           'ssbd',
           'sse',
           'sse2',
           'sse4_1',
           'sse4_2',
           'sse4a',
           'ssse3',
           'stibp',
           'syscall',
           'topoext',
           'tsc',
           'tsc_known_freq',
           'vme',
           'vmmcall',
           'wbnoinvd',
           'xgetbv1',
           'xsave',
           'xsavec',
           'xsaveerptr',
           'xsaveopt'],
 'hz_actual': [2800000000, 0],
 'hz_actual_friendly': '2.8000 GHz',
 'hz_advertised': [2800000000, 0],
 'hz_advertised_friendly': '2.8000 GHz',
 'l1_data_cache_size': 32768,
 'l1_instruction_cache_size': 3145728,
 'l2_cache_size': 524288,
 'l3_cache_size': 115964116992,
 'model': 49,
 'python_version': '3.12.9.final.0 (64 bit)',
 'vendor_id_raw': 'AuthenticAMD'}

@FrancescAlted
Copy link
Member

What was happening is that the cache size discovery machinery was not working correctly. This, with a glitch with the cap for the chunksize, was causing the error.

I have fixed the cap with chunksize in main; can you do a quick check that it works in your machine?

For a more accurate fix, can you tell us which is the output of:

lscpu --json

and

cat /sys/devices/system/cpu/cpu0/cache/index3/size

in your machine?

@FrancescAlted
Copy link
Member

BTW, a new python-blosc2 3.3.2 version, with this fix in, has been released.

@sirfz
Copy link
Author

sirfz commented May 1, 2025

3.3.2 confirmed working now, thank you!

For what it's worth, here are the outputs you asked for:

lscpu --json:

{
   "lscpu": [
      {
         "field": "Architecture:",
         "data": "x86_64",
         "children": [
            {
               "field": "CPU op-mode(s):",
               "data": "32-bit, 64-bit"
            },{
               "field": "Address sizes:",
               "data": "48 bits physical, 48 bits virtual"
            },{
               "field": "Byte Order:",
               "data": "Little Endian"
            }
         ]
      },{
         "field": "CPU(s):",
         "data": "192",
         "children": [
            {
               "field": "On-line CPU(s) list:",
               "data": "0-191"
            }
         ]
      },{
         "field": "Vendor ID:",
         "data": "AuthenticAMD",
         "children": [
            {
               "field": "Model name:",
               "data": "AMD EPYC 7R32",
               "children": [
                  {
                     "field": "CPU family:",
                     "data": "23"
                  },{
                     "field": "Model:",
                     "data": "49"
                  },{
                     "field": "Thread(s) per core:",
                     "data": "2"
                  },{
                     "field": "Core(s) per socket:",
                     "data": "48"
                  },{
                     "field": "Socket(s):",
                     "data": "2"
                  },{
                     "field": "Stepping:",
                     "data": "0"
                  },{
                     "field": "BogoMIPS:",
                     "data": "5600.00"
                  },{
                     "field": "Flags:",
                     "data": "fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq monitor ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext perfctr_core ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save rdpid"
                  }
               ]
            }
         ]
      },{
         "field": "Virtualization features:",
         "data": null,
         "children": [
            {
               "field": "Hypervisor vendor:",
               "data": "KVM"
            },{
               "field": "Virtualization type:",
               "data": "full"
            }
         ]
      },{
         "field": "Caches (sum of all):",
         "data": null,
         "children": [
            {
               "field": "L1d:",
               "data": "3 MiB (96 instances)"
            },{
               "field": "L1i:",
               "data": "3 MiB (96 instances)"
            },{
               "field": "L2:",
               "data": "48 MiB (96 instances)"
            },{
               "field": "L3:",
               "data": "384 MiB (24 instances)"
            }
         ]
      },{
         "field": "NUMA:",
         "data": null,
         "children": [
            {
               "field": "NUMA node(s):",
               "data": "2"
            },{
               "field": "NUMA node0 CPU(s):",
               "data": "0-47,96-143"
            },{
               "field": "NUMA node1 CPU(s):",
               "data": "48-95,144-191"
            }
         ]
      },{
         "field": "Vulnerabilities:",
         "data": null,
         "children": [
            {
               "field": "Gather data sampling:",
               "data": "Not affected"
            },{
               "field": "Itlb multihit:",
               "data": "Not affected"
            },{
               "field": "L1tf:",
               "data": "Not affected"
            },{
               "field": "Mds:",
               "data": "Not affected"
            },{
               "field": "Meltdown:",
               "data": "Not affected"
            },{
               "field": "Mmio stale data:",
               "data": "Not affected"
            },{
               "field": "Reg file data sampling:",
               "data": "Not affected"
            },{
               "field": "Retbleed:",
               "data": "Mitigation; untrained return thunk; SMT enabled with STIBP protection"
            },{
               "field": "Spec rstack overflow:",
               "data": "Vulnerable: Safe RET, no microcode"
            },{
               "field": "Spec store bypass:",
               "data": "Mitigation; Speculative Store Bypass disabled via prctl"
            },{
               "field": "Spectre v1:",
               "data": "Mitigation; usercopy/swapgs barriers and __user pointer sanitization"
            },{
               "field": "Spectre v2:",
               "data": "Mitigation; Retpolines; IBPB conditional; STIBP always-on; RSB filling; PBRSB-eIBRS Not affected; BHI Not affected"
            },{
               "field": "Srbds:",
               "data": "Not affected"
            },{
               "field": "Tsx async abort:",
               "data": "Not affected"
            }
         ]
      }
   ]
}

and

$ cat /sys/devices/system/cpu/cpu0/cache/index3/size
16384K

@FrancescAlted
Copy link
Member

Thanks for the output. With this, I have come with a more refined way for guessing cache sizes in 11584f1. Can you try the code in main (just install it with pip install git+https://github.com/Blosc/python-blosc2.git@main), and tell me how performance is affected?

@sirfz
Copy link
Author

sirfz commented May 2, 2025

Well I don't have any code to test right now (I hit this error while working on a problem and wanted to test with blosc2 but I've shifted to something else at the moment). If you have any code snippet you'd like me to test I'd be happy to run it

@FrancescAlted
Copy link
Member

Yes, that would be great. Can you please run the script in https://github.com/Blosc/python-blosc2/blob/main/bench/ndarray/compute_dists2.py with the 'large' param like this?

/usr/bin/time -v python bench/ndarray/compute_dists2.py large

and send the output back? It should take less than 5 min in your machine. Also, the script creates a plot file (blosc2_vs_numexpr_subplots.png) in current working directory; please attach it to this ticket too. Thanks!

@sirfz
Copy link
Author

sirfz commented May 2, 2025

Benchmarking constant distribution...
Blosc2 - constant - Size 3000x3000: 3.55 GB/s - cratio: 6280.5x
Blosc2 - constant - Size 6000x6000: 5.71 GB/s - cratio: 6306.9x
Blosc2 - constant - Size 9000x9000: 6.11 GB/s - cratio: 5681.0x
Blosc2 - constant - Size 12000x12000: 6.79 GB/s - cratio: 5051.2x
Blosc2 - constant - Size 15000x15000: 7.27 GB/s - cratio: 6314.4x
Blosc2 - constant - Size 18000x18000: 11.44 GB/s - cratio: 3788.8x
Blosc2 - constant - Size 21000x21000: 12.23 GB/s - cratio: 4420.3x
Blosc2 - constant - Size 24000x24000: 14.78 GB/s - cratio: 5051.6x
Blosc2 - constant - Size 27000x27000: 15.31 GB/s - cratio: 5683.1x
Blosc2 - constant - Size 30000x30000: 18.58 GB/s - cratio: 6314.4x
Numexpr - constant - Size 3000x3000: 9.17 GB/s
Numexpr - constant - Size 6000x6000: 23.48 GB/s
Numexpr - constant - Size 9000x9000: 64.67 GB/s
Numexpr - constant - Size 12000x12000: 55.28 GB/s
Numexpr - constant - Size 15000x15000: 59.04 GB/s
Numexpr - constant - Size 18000x18000: 58.97 GB/s
Numexpr - constant - Size 21000x21000: 8.45 GB/s
Numexpr - constant - Size 24000x24000: 7.43 GB/s
Numexpr - constant - Size 27000x27000: 14.82 GB/s
Numexpr - constant - Size 30000x30000: 9.13 GB/s

Benchmarking arange distribution...
Blosc2 - arange - Size 3000x3000: 1.26 GB/s - cratio: 5968.2x
Blosc2 - arange - Size 6000x6000: 5.39 GB/s - cratio: 5992.0x
Blosc2 - arange - Size 9000x9000: 6.61 GB/s - cratio: 5397.1x
Blosc2 - arange - Size 12000x12000: 6.92 GB/s - cratio: 4798.7x
Blosc2 - arange - Size 15000x15000: 6.60 GB/s - cratio: 5852.4x
Blosc2 - arange - Size 18000x18000: 9.29 GB/s - cratio: 3599.4x
Blosc2 - arange - Size 21000x21000: 11.07 GB/s - cratio: 4097.0x
Blosc2 - arange - Size 24000x24000: 14.21 GB/s - cratio: 4799.0x
Blosc2 - arange - Size 27000x27000: 14.77 GB/s - cratio: 5267.4x
Blosc2 - arange - Size 30000x30000: 16.80 GB/s - cratio: 5852.4x
Numexpr - arange - Size 3000x3000: 16.73 GB/s
Numexpr - arange - Size 6000x6000: 60.64 GB/s
Numexpr - arange - Size 9000x9000: 75.35 GB/s
Numexpr - arange - Size 12000x12000: 67.92 GB/s
Numexpr - arange - Size 15000x15000: 65.64 GB/s
Numexpr - arange - Size 18000x18000: 65.53 GB/s
Numexpr - arange - Size 21000x21000: 67.80 GB/s
Numexpr - arange - Size 24000x24000: 67.48 GB/s
Numexpr - arange - Size 27000x27000: 12.18 GB/s
Numexpr - arange - Size 30000x30000: 13.03 GB/s

Benchmarking linspace distribution...
Blosc2 - linspace - Size 3000x3000: 1.47 GB/s - cratio: 241.0x
Blosc2 - linspace - Size 6000x6000: 5.67 GB/s - cratio: 320.6x
Blosc2 - linspace - Size 9000x9000: 6.35 GB/s - cratio: 417.6x
Blosc2 - linspace - Size 12000x12000: 7.00 GB/s - cratio: 426.0x
Blosc2 - linspace - Size 15000x15000: 6.39 GB/s - cratio: 479.3x
Blosc2 - linspace - Size 18000x18000: 9.64 GB/s - cratio: 453.5x
Blosc2 - linspace - Size 21000x21000: 11.19 GB/s - cratio: 527.6x
Blosc2 - linspace - Size 24000x24000: 14.62 GB/s - cratio: 492.0x
Blosc2 - linspace - Size 27000x27000: 14.62 GB/s - cratio: 561.9x
Blosc2 - linspace - Size 30000x30000: 16.06 GB/s - cratio: 503.2x
Numexpr - linspace - Size 3000x3000: 19.98 GB/s
Numexpr - linspace - Size 6000x6000: 69.65 GB/s
Numexpr - linspace - Size 9000x9000: 71.35 GB/s
Numexpr - linspace - Size 12000x12000: 61.82 GB/s
Numexpr - linspace - Size 15000x15000: 67.47 GB/s
Numexpr - linspace - Size 18000x18000: 65.62 GB/s
Numexpr - linspace - Size 21000x21000: 69.22 GB/s
Numexpr - linspace - Size 24000x24000: 67.93 GB/s
Numexpr - linspace - Size 27000x27000: 67.95 GB/s
Numexpr - linspace - Size 30000x30000: 66.34 GB/s
        Command being timed: "uv run python bench/ndarray/compute_dists2.py large"
        User time (seconds): 492.40
        System time (seconds): 1785.30
        Percent of CPU this job got: 172%
        Elapsed (wall clock) time (h:mm:ss or m:ss): 22:00.29
        Average shared text size (kbytes): 0
        Average unshared data size (kbytes): 0
        Average stack size (kbytes): 0
        Average total size (kbytes): 0
        Maximum resident set size (kbytes): 32688108
        Average resident set size (kbytes): 0
        Major (requiring I/O) page faults: 0
        Minor (reclaiming a frame) page faults: 30026709
        Voluntary context switches: 239220
        Involuntary context switches: 10941
        Swaps: 0
        File system inputs: 0
        File system outputs: 18808
        Socket messages sent: 0
        Socket messages received: 0
        Signals delivered: 0
        Page size (bytes): 4096
        Exit status: 0

Image

@FrancescAlted
Copy link
Member

FrancescAlted commented May 2, 2025

Hmm, interesting. Your CPU seems to run Blosc2 quite inefficiently. The reasons are unknown to me, as we generally don't have access to multi-socket CPUs. A possibility is that Zen2 is not very good at compressing/decompressing, but as Zen2 is not that old, I tend to think that there is quite a lot of room for improvement in multi-socket scenarios.

For what is worth, here are the benchmarks for our AMD box (9800X3D, Zen5), with 64 GB of RAM:

Benchmarking constant distribution...
Blosc2 - constant - Size 3000x3000: 18.09 GB/s - cratio: 6177.1x
Blosc2 - constant - Size 6000x6000: 46.99 GB/s - cratio: 6185.6x
Blosc2 - constant - Size 9000x9000: 31.74 GB/s - cratio: 5184.0x
Blosc2 - constant - Size 12000x12000: 133.59 GB/s - cratio: 4968.9x
Blosc2 - constant - Size 15000x15000: 153.75 GB/s - cratio: 6177.1x
Blosc2 - constant - Size 18000x18000: 160.01 GB/s - cratio: 3739.1x
Blosc2 - constant - Size 21000x21000: 159.11 GB/s - cratio: 4351.3x
Blosc2 - constant - Size 24000x24000: 155.22 GB/s - cratio: 4968.9x
Blosc2 - constant - Size 27000x27000: 130.23 GB/s - cratio: 5579.8x
Blosc2 - constant - Size 30000x30000: 155.67 GB/s - cratio: 6185.6x
Numexpr - constant - Size 3000x3000: 17.78 GB/s
Numexpr - constant - Size 6000x6000: 39.03 GB/s
Numexpr - constant - Size 9000x9000: 41.77 GB/s
Numexpr - constant - Size 12000x12000: 39.58 GB/s
Numexpr - constant - Size 15000x15000: 39.72 GB/s
Numexpr - constant - Size 18000x18000: 40.46 GB/s
Numexpr - constant - Size 21000x21000: 40.80 GB/s
Numexpr - constant - Size 24000x24000: 41.12 GB/s
Numexpr - constant - Size 27000x27000: 41.45 GB/s
Numexpr - constant - Size 30000x30000: 41.58 GB/s

Benchmarking arange distribution...
Blosc2 - arange - Size 3000x3000: 11.71 GB/s - cratio: 5874.7x
Blosc2 - arange - Size 6000x6000: 51.74 GB/s - cratio: 5882.4x
Blosc2 - arange - Size 9000x9000: 34.21 GB/s - cratio: 4638.0x
Blosc2 - arange - Size 12000x12000: 97.05 GB/s - cratio: 4724.4x
Blosc2 - arange - Size 15000x15000: 110.62 GB/s - cratio: 5734.3x
Blosc2 - arange - Size 18000x18000: 115.77 GB/s - cratio: 3554.5x
Blosc2 - arange - Size 21000x21000: 126.23 GB/s - cratio: 4037.5x
Blosc2 - arange - Size 24000x24000: 121.86 GB/s - cratio: 4724.4x
Blosc2 - arange - Size 27000x27000: 114.91 GB/s - cratio: 5178.5x
Blosc2 - arange - Size 30000x30000: 115.04 GB/s - cratio: 5741.6x
Numexpr - arange - Size 3000x3000: 22.45 GB/s
Numexpr - arange - Size 6000x6000: 39.28 GB/s
Numexpr - arange - Size 9000x9000: 41.44 GB/s
Numexpr - arange - Size 12000x12000: 39.68 GB/s
Numexpr - arange - Size 15000x15000: 39.63 GB/s
Numexpr - arange - Size 18000x18000: 40.32 GB/s
Numexpr - arange - Size 21000x21000: 40.78 GB/s
Numexpr - arange - Size 24000x24000: 41.11 GB/s
Numexpr - arange - Size 27000x27000: 41.55 GB/s
Numexpr - arange - Size 30000x30000: 41.66 GB/s

Benchmarking linspace distribution...
Blosc2 - linspace - Size 3000x3000: 11.43 GB/s - cratio: 240.8x
Blosc2 - linspace - Size 6000x6000: 62.70 GB/s - cratio: 320.3x
Blosc2 - linspace - Size 9000x9000: 26.31 GB/s - cratio: 408.6x
Blosc2 - linspace - Size 12000x12000: 87.68 GB/s - cratio: 425.4x
Blosc2 - linspace - Size 15000x15000: 94.90 GB/s - cratio: 478.5x
Blosc2 - linspace - Size 18000x18000: 93.14 GB/s - cratio: 449.6x
Blosc2 - linspace - Size 21000x21000: 101.17 GB/s - cratio: 529.0x
Blosc2 - linspace - Size 24000x24000: 104.50 GB/s - cratio: 491.2x
Blosc2 - linspace - Size 27000x27000: 103.89 GB/s - cratio: 560.8x
Blosc2 - linspace - Size 30000x30000: 92.90 GB/s - cratio: 503.0x
Numexpr - linspace - Size 3000x3000: 26.99 GB/s
Numexpr - linspace - Size 6000x6000: 39.38 GB/s
Numexpr - linspace - Size 9000x9000: 41.30 GB/s
Numexpr - linspace - Size 12000x12000: 39.71 GB/s
Numexpr - linspace - Size 15000x15000: 39.56 GB/s
Numexpr - linspace - Size 18000x18000: 40.53 GB/s
Numexpr - linspace - Size 21000x21000: 40.98 GB/s
Numexpr - linspace - Size 24000x24000: 41.16 GB/s
Numexpr - linspace - Size 27000x27000: 41.51 GB/s
Numexpr - linspace - Size 30000x30000: 41.65 GB/s
	Command being timed: "python bench/ndarray/compute_dists2.py large"
	User time (seconds): 156.13
	System time (seconds): 42.27
	Percent of CPU this job got: 244%
	Elapsed (wall clock) time (h:mm:ss or m:ss): 1:21.05
	Average shared text size (kbytes): 0
	Average unshared data size (kbytes): 0
	Average stack size (kbytes): 0
	Average total size (kbytes): 0
	Maximum resident set size (kbytes): 27955872
	Average resident set size (kbytes): 0
	Major (requiring I/O) page faults: 0
	Minor (reclaiming a frame) page faults: 510382
	Voluntary context switches: 879587
	Involuntary context switches: 21990
	Swaps: 0
	File system inputs: 0
	File system outputs: 824
	Socket messages sent: 0
	Socket messages received: 0
	Signals delivered: 0
	Page size (bytes): 4096
	Exit status: 0

Image

@sirfz
Copy link
Author

sirfz commented May 2, 2025

I tried running it with numactl cpunode and membind which doesn't seem to have made a difference for blosc but drastically improved numexpr:

Benchmarking constant distribution...
Blosc2 - constant - Size 3000x3000: 3.76 GB/s - cratio: 6280.5x
Blosc2 - constant - Size 6000x6000: 5.38 GB/s - cratio: 6306.9x
Blosc2 - constant - Size 9000x9000: 6.11 GB/s - cratio: 5681.0x
Blosc2 - constant - Size 12000x12000: 6.59 GB/s - cratio: 5051.2x
Blosc2 - constant - Size 15000x15000: 6.96 GB/s - cratio: 6314.4x
Blosc2 - constant - Size 18000x18000: 10.65 GB/s - cratio: 3788.8x
Blosc2 - constant - Size 21000x21000: 11.20 GB/s - cratio: 4420.3x
Blosc2 - constant - Size 24000x24000: 13.82 GB/s - cratio: 5051.6x
Blosc2 - constant - Size 27000x27000: 14.31 GB/s - cratio: 5683.1x
Blosc2 - constant - Size 30000x30000: 16.42 GB/s - cratio: 6314.4x
Numexpr - constant - Size 3000x3000: 33.52 GB/s
Numexpr - constant - Size 6000x6000: 29.12 GB/s
Numexpr - constant - Size 9000x9000: 43.61 GB/s
Numexpr - constant - Size 12000x12000: 51.13 GB/s
Numexpr - constant - Size 15000x15000: 59.47 GB/s
Numexpr - constant - Size 18000x18000: 63.32 GB/s
Numexpr - constant - Size 21000x21000: 63.91 GB/s
Numexpr - constant - Size 24000x24000: 64.73 GB/s
Numexpr - constant - Size 27000x27000: 65.23 GB/s
Numexpr - constant - Size 30000x30000: 65.71 GB/s

Benchmarking arange distribution...
Blosc2 - arange - Size 3000x3000: 2.27 GB/s - cratio: 5968.2x
Blosc2 - arange - Size 6000x6000: 5.50 GB/s - cratio: 5992.0x
Blosc2 - arange - Size 9000x9000: 6.22 GB/s - cratio: 5397.1x
Blosc2 - arange - Size 12000x12000: 6.62 GB/s - cratio: 4798.7x
Blosc2 - arange - Size 15000x15000: 6.92 GB/s - cratio: 5852.4x
Blosc2 - arange - Size 18000x18000: 10.72 GB/s - cratio: 3599.4x
Blosc2 - arange - Size 21000x21000: 11.23 GB/s - cratio: 4097.0x
Blosc2 - arange - Size 24000x24000: 13.90 GB/s - cratio: 4799.0x
Blosc2 - arange - Size 27000x27000: 14.32 GB/s - cratio: 5267.4x
Blosc2 - arange - Size 30000x30000: 15.79 GB/s - cratio: 5852.4x
Numexpr - arange - Size 3000x3000: 21.42 GB/s
Numexpr - arange - Size 6000x6000: 42.73 GB/s
Numexpr - arange - Size 9000x9000: 45.86 GB/s
Numexpr - arange - Size 12000x12000: 52.16 GB/s
Numexpr - arange - Size 15000x15000: 59.74 GB/s
Numexpr - arange - Size 18000x18000: 62.47 GB/s
Numexpr - arange - Size 21000x21000: 64.25 GB/s
Numexpr - arange - Size 24000x24000: 65.40 GB/s
Numexpr - arange - Size 27000x27000: 66.09 GB/s
Numexpr - arange - Size 30000x30000: 38.46 GB/s

Benchmarking linspace distribution...
Blosc2 - linspace - Size 3000x3000: 2.16 GB/s - cratio: 241.0x
Blosc2 - linspace - Size 6000x6000: 5.46 GB/s - cratio: 320.6x
Blosc2 - linspace - Size 9000x9000: 6.07 GB/s - cratio: 417.6x
Blosc2 - linspace - Size 12000x12000: 6.55 GB/s - cratio: 426.0x
Blosc2 - linspace - Size 15000x15000: 7.02 GB/s - cratio: 479.3x
Blosc2 - linspace - Size 18000x18000: 10.78 GB/s - cratio: 453.5x
Blosc2 - linspace - Size 21000x21000: 11.42 GB/s - cratio: 527.6x
Blosc2 - linspace - Size 24000x24000: 14.01 GB/s - cratio: 492.0x
Blosc2 - linspace - Size 27000x27000: 14.77 GB/s - cratio: 561.9x
Blosc2 - linspace - Size 30000x30000: 16.22 GB/s - cratio: 503.2x
Numexpr - linspace - Size 3000x3000: 18.15 GB/s
Numexpr - linspace - Size 6000x6000: 47.70 GB/s
Numexpr - linspace - Size 9000x9000: 48.58 GB/s
Numexpr - linspace - Size 12000x12000: 54.00 GB/s
Numexpr - linspace - Size 15000x15000: 62.39 GB/s
Numexpr - linspace - Size 18000x18000: 64.92 GB/s
Numexpr - linspace - Size 21000x21000: 65.72 GB/s
Numexpr - linspace - Size 24000x24000: 65.26 GB/s
Numexpr - linspace - Size 27000x27000: 67.60 GB/s
Numexpr - linspace - Size 30000x30000: 67.76 GB/s
        Command being timed: "numactl --cpunode=0 --membind=0 ./.venv/bin/python bench/ndarray/compute_dists2.py large"
        User time (seconds): 591.42
        System time (seconds): 348.34
        Percent of CPU this job got: 339%
        Elapsed (wall clock) time (h:mm:ss or m:ss): 4:36.42
        Average shared text size (kbytes): 0
        Average unshared data size (kbytes): 0
        Average stack size (kbytes): 0
        Average total size (kbytes): 0
        Maximum resident set size (kbytes): 30987264
        Average resident set size (kbytes): 0
        Major (requiring I/O) page faults: 0
        Minor (reclaiming a frame) page faults: 26513895
        Voluntary context switches: 179645
        Involuntary context switches: 9696
        Swaps: 0
        File system inputs: 0
        File system outputs: 1152
        Socket messages sent: 0
        Socket messages received: 0
        Signals delivered: 0
        Page size (bytes): 4096
        Exit status: 0

Image

@FrancescAlted
Copy link
Member

Ok. This is with main branch, or stock 3.3.2? It would be nice to see a comparison between both, as they make quite different L3 guesses.

@sirfz
Copy link
Author

sirfz commented May 5, 2025

Those were with main, here are the results for 3.3.2:

Benchmarking constant distribution...
Blosc2 - constant - Size 3000x3000: 3.73 GB/s - cratio: 6280.5x
Blosc2 - constant - Size 6000x6000: 5.45 GB/s - cratio: 6306.9x
Blosc2 - constant - Size 9000x9000: 6.24 GB/s - cratio: 5681.0x
Blosc2 - constant - Size 12000x12000: 6.74 GB/s - cratio: 5051.2x
Blosc2 - constant - Size 15000x15000: 6.99 GB/s - cratio: 6314.4x
Blosc2 - constant - Size 18000x18000: 10.95 GB/s - cratio: 3788.8x
Blosc2 - constant - Size 21000x21000: 11.39 GB/s - cratio: 4420.3x
Blosc2 - constant - Size 24000x24000: 14.26 GB/s - cratio: 5051.6x
Blosc2 - constant - Size 27000x27000: 14.40 GB/s - cratio: 5683.1x
Blosc2 - constant - Size 30000x30000: 16.52 GB/s - cratio: 6314.4x
Numexpr - constant - Size 3000x3000: 28.70 GB/s
Numexpr - constant - Size 6000x6000: 46.40 GB/s
Numexpr - constant - Size 9000x9000: 48.29 GB/s
Numexpr - constant - Size 12000x12000: 51.97 GB/s
Numexpr - constant - Size 15000x15000: 60.65 GB/s
Numexpr - constant - Size 18000x18000: 64.06 GB/s
Numexpr - constant - Size 21000x21000: 64.17 GB/s
Numexpr - constant - Size 24000x24000: 64.99 GB/s
Numexpr - constant - Size 27000x27000: 65.89 GB/s
Numexpr - constant - Size 30000x30000: 66.33 GB/s

Benchmarking arange distribution...
Blosc2 - arange - Size 3000x3000: 2.08 GB/s - cratio: 5968.2x
Blosc2 - arange - Size 6000x6000: 4.96 GB/s - cratio: 5992.0x
Blosc2 - arange - Size 9000x9000: 6.28 GB/s - cratio: 5397.1x
Blosc2 - arange - Size 12000x12000: 6.59 GB/s - cratio: 4798.7x
Blosc2 - arange - Size 15000x15000: 6.75 GB/s - cratio: 5852.4x
Blosc2 - arange - Size 18000x18000: 10.96 GB/s - cratio: 3599.4x
Blosc2 - arange - Size 21000x21000: 11.10 GB/s - cratio: 4097.0x
Blosc2 - arange - Size 24000x24000: 14.03 GB/s - cratio: 4799.0x
Blosc2 - arange - Size 27000x27000: 14.32 GB/s - cratio: 5267.4x
Blosc2 - arange - Size 30000x30000: 16.52 GB/s - cratio: 5852.4x
Numexpr - arange - Size 3000x3000: 22.03 GB/s
Numexpr - arange - Size 6000x6000: 52.31 GB/s
Numexpr - arange - Size 9000x9000: 52.69 GB/s
Numexpr - arange - Size 12000x12000: 55.18 GB/s
Numexpr - arange - Size 15000x15000: 63.60 GB/s
Numexpr - arange - Size 18000x18000: 64.22 GB/s
Numexpr - arange - Size 21000x21000: 65.45 GB/s
Numexpr - arange - Size 24000x24000: 66.18 GB/s
Numexpr - arange - Size 27000x27000: 67.07 GB/s
Numexpr - arange - Size 30000x30000: 67.31 GB/s

Benchmarking linspace distribution...
Blosc2 - linspace - Size 3000x3000: 2.15 GB/s - cratio: 241.0x
Blosc2 - linspace - Size 6000x6000: 5.43 GB/s - cratio: 320.6x
Blosc2 - linspace - Size 9000x9000: 6.14 GB/s - cratio: 417.6x
Blosc2 - linspace - Size 12000x12000: 6.69 GB/s - cratio: 426.0x
Blosc2 - linspace - Size 15000x15000: 6.92 GB/s - cratio: 479.3x
Blosc2 - linspace - Size 18000x18000: 10.82 GB/s - cratio: 453.5x
Blosc2 - linspace - Size 21000x21000: 10.85 GB/s - cratio: 527.6x
Blosc2 - linspace - Size 24000x24000: 13.81 GB/s - cratio: 492.0x
Blosc2 - linspace - Size 27000x27000: 14.34 GB/s - cratio: 561.9x
Blosc2 - linspace - Size 30000x30000: 16.33 GB/s - cratio: 503.2x
Numexpr - linspace - Size 3000x3000: 18.74 GB/s
Numexpr - linspace - Size 6000x6000: 45.80 GB/s
Numexpr - linspace - Size 9000x9000: 47.38 GB/s
Numexpr - linspace - Size 12000x12000: 52.24 GB/s
Numexpr - linspace - Size 15000x15000: 60.96 GB/s
Numexpr - linspace - Size 18000x18000: 63.57 GB/s
Numexpr - linspace - Size 21000x21000: 64.86 GB/s
Numexpr - linspace - Size 24000x24000: 65.57 GB/s
Numexpr - linspace - Size 27000x27000: 65.34 GB/s
Numexpr - linspace - Size 30000x30000: 66.30 GB/s
        Command being timed: "numactl --cpunode=0 --membind=0 ./.venv/bin/python bench/ndarray/compute_dists2.py large"
        User time (seconds): 585.55
        System time (seconds): 327.29
        Percent of CPU this job got: 347%
        Elapsed (wall clock) time (h:mm:ss or m:ss): 4:22.49
        Average shared text size (kbytes): 0
        Average unshared data size (kbytes): 0
        Average stack size (kbytes): 0
        Average total size (kbytes): 0
        Maximum resident set size (kbytes): 30937596
        Average resident set size (kbytes): 0
        Major (requiring I/O) page faults: 9
        Minor (reclaiming a frame) page faults: 26464104
        Voluntary context switches: 167039
        Involuntary context switches: 9807
        Swaps: 0
        File system inputs: 2248
        File system outputs: 1744
        Socket messages sent: 0
        Socket messages received: 0
        Signals delivered: 0
        Page size (bytes): 4096
        Exit status: 0

Image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants