Preventing Deadlocks When Reading Metadata Concurrently via asyncio.gather
#3207
+63
−19
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
As described in #3196, I encountered issues opening Zarr v3 arrays stored over SFTP using
fsspec
. Specifically, python would freeze opening zarr arrays.Root Cause
The issue stems from the use of
asyncio.gather
inzarr.core.array.get_array_metadata
, which attempts to read multiple metadata files (e.g.,.zarray
,.zattrs
,zarr.json
) concurrently. This works well for truly asynchronous filesystems, but breaks when using systems likeSFTPFileSystem
, which does not seem to be concurrency-safe in async contexts (potentially relying on blocking I/O internally or managing connection states using global locks) leading to deadlocks or indefinite hangs whenasyncio.gather
is used to perform multiple reads simultaneously.Solution
To address this, I’ve implemented a fallback to sequential reads for filesystems that are not concurrency-safe. The logic is as follows: For non asynchronous file systems, the user sets
store.fs.asynchronous=False
. The helper functionis_concurrency_safe(store_path: StorePath) -> bool
, checks thisgetattr(fs, "asynchronous", True)
. If Trueasyncio.gather
is used, else we fall back to sequentialawait
. This Preserves the performance benefit of concurrent reads for safe filesystems (e.g., local disk, S3, GCS), while preventing deadlocks and improved robustness when using backends like SFTP.These changes may not address all scenarios not asynchronous file systems could cause issues, as there are several other instances of
asyncio.gather
inzarr.core.array
andzarr.core.group
. However, I opted to focus on this specific problem first, as enabling the opening of arrays and groups is likely the highest priority, and I wanted to discuss this approach before making too many changes.I look forward to hearing your thoughts and seeing this issue resolved!
TODO:
docs/user-guide/*.rst
changes/