Description
I've looked at the Zarr API and didn't see a way to list all the file assets (files, data objects, etc), associated with some Zarr-encoded data. The goal of such an API call would be to produce a concise and complete list of strings, where each string is a representation of the location (e.g., a URL or filesystem path) of some file-like object on some storage system. "Concise" in the sense that all listed files are needed (at least theoretically) by at least one Zarr program (the list should not include any extraneous files). "Complete" in the sense that the Zarr library will not access any data files other than those listed.
There are two use-cases that (I think) would benefit from such an API call.
- copy all the data from one storage system to another.
In order to improve availability and reduce latency (assuming the data is read multiple times), a user might wish to make an additional replica of some Zarr-encoded data. The copy could be to some geographically distinct location (e.g., from Europe to USA), but could also be across multiple storage technologies (e.g., making some dataset temporarily available on space-constrained nVME storage).
- validating that storage has all necessary files.
In some scenarios, it may be helpful to catch corruption problems ahead of data use. One simple test is to check whether all the expected files exist on the storage and that the client is authorised to open these files for reading.
From searching various forums, it seems (from related discussion) that use-cases like this are supported by the client simply listing the contents of the underlying storage (listing content of some AWS bucket, directory listing of some POSIX storage, etc.). One potential problem is a lack of guarantee that all the files (in the storage) belong to the Zarr-encoded data or that all the files (in the Zarr data) are present on the storage. For example, when copying data, if there are any extraneous/irrelevant files then they too would be copied; for validating data, its unclear what files should be present.
Is there already a way of producing a list of file assets of some Zarr data?
If not, would adding support for this make sense?