Skip to content

Fix critical np.timedelta64 encoding bugs #10469

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 7 commits into from
Jul 2, 2025

Conversation

spencerkclark
Copy link
Member

@spencerkclark spencerkclark commented Jun 30, 2025

This PR fixes the critical np.timedelta64 encoding bugs introduced in #10101. We now always encode np.timedelta64 values with a dtype attribute corresponding to the in-memory dtype, and use the same encoding path that we did previously, which by default selects the coarsest units that support integer serialization. This enables storing "timedelta64[ns]" values in netCDF3 format, which was not supported by the "literal" encoding approach implemented in #10101.

For consistency with the previous units-based decoding approach, this update also now enables controlling the decoded resolution of dtype-decoded values via the time_unit parameter of the CFTimedeltaCoder class. By default the time_unit parameter is now None. If the time_unit is None the dtype attribute determines the dtype the data is decoded to; if the time_unit is not None it takes precedence.

cc: @shoyer @kmuehlbauer; sorry again for the trouble.

Copy link
Member

@shoyer shoyer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, Spencer!

@shoyer
Copy link
Member

shoyer commented Jun 30, 2025

One other strategic question -- do we need this separate logic for decoding into an integer dtype that exactly matches the precision of the datetime64 units?

xarray/xarray/coding/times.py

Lines 1456 to 1487 in 4a581f4

else:
resolution, _ = np.datetime_data(variable.dtype)
dtype = np.int64
attrs_dtype = f"timedelta64[{resolution}]"
units = _numpy_dtype_to_netcdf_timeunit(variable.dtype)
safe_setitem(attrs, "dtype", attrs_dtype, name=name)
# Remove dtype encoding if it exists to prevent it from
# interfering downstream in NonStringCoder.
encoding.pop("dtype", None)
if any(
k in encoding for k in _INVALID_LITERAL_TIMEDELTA64_ENCODING_KEYS
):
raise ValueError(
f"Specifying 'add_offset' or 'scale_factor' is not "
f"supported when encoding the timedelta64 values of "
f"variable {name!r} with xarray's new default "
f"timedelta64 encoding approach. To encode {name!r} "
f"with xarray's previous timedelta64 encoding "
f"approach, which supports the 'add_offset' and "
f"'scale_factor' parameters, additionally set "
f"encoding['units'] to a unit of time, e.g. "
f"'seconds'. To proceed with encoding of {name!r} "
f"via xarray's new approach, remove any encoding "
f"entries for 'add_offset' or 'scale_factor'."
)
if "_FillValue" not in encoding and "missing_value" not in encoding:
encoding["_FillValue"] = np.iinfo(np.int64).min
data, units = encode_cf_timedelta(data, units, dtype)
safe_setitem(attrs, "units", units, name=name)
return Variable(dims, data, attrs, encoding, fastpath=True)

I am wondering because it seems like this logic is already covered in encode_cf_timedelta(), and adding this explicit check makes means that timedelta64[ns] data cannot be saved directly to netCDF3 files unless the units are converted first. timedelta64[ns] worked via encode_cf_timedelta as long as nanosecond precision isn't really needed, similar to how datetime64[ns] is OK. This was convenient for users of "human" time units.

Copy link
Member Author

@spencerkclark spencerkclark left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the quick feedback @shoyer!

I am wondering because it seems like this logic is already covered in encode_cf_timedelta(), and adding this explicit check makes means that timedelta64[ns] data cannot be saved directly to netCDF3 files unless the units are converted first. timedelta64[ns] worked via encode_cf_timedelta as long as nanosecond precision isn't really needed, similar to how datetime64[ns] is OK. This was convenient for users of "human" time units.

I agree this is annoying, and it seems like we should be able to relax it some. I'll think about it some more and see what I can do.

@spencerkclark spencerkclark changed the title Fix critical literal np.timedelta64 encoding bugs Fix critical np.timedelta64 encoding bugs Jul 1, 2025
@spencerkclark spencerkclark force-pushed the fix-timedelta-coding branch from e189b66 to 624625b Compare July 1, 2025 15:52
Always write a dtype attribute to disk regardless of how the timedeltas were
decoded.
@spencerkclark spencerkclark force-pushed the fix-timedelta-coding branch from 624625b to bdda733 Compare July 1, 2025 15:54
@dcherian
Copy link
Contributor

dcherian commented Jul 1, 2025

Would we be open to renaming the attribute that encodes the timedelta64 dtype "xarray_dtype"?

I made the similar proposal for Intervals here: #8005 (comment)

@spencerkclark
Copy link
Member Author

Would we be open to renaming the attribute that encodes the timedelta64 dtype "xarray_dtype"?

I made the similar proposal for Intervals here: #8005 (comment)

Thanks @dcherianStephan's suggestion of always writing a dtype attribute to disk, regardless of how the data existed in the original file, helped obviate the need for renaming this attribute in this PR. The name dtype has some precedent, e.g. when coding boolean values, which is why it was initially chosen in #10101:

def encode(self, variable: Variable, name: T_Name = None) -> Variable:
if (
(variable.dtype == bool)
and ("dtype" not in variable.encoding)
and ("dtype" not in variable.attrs)
):
dims, data, attrs, encoding = unpack_for_encoding(variable)
attrs["dtype"] = "bool"
data = duck_array_ops.astype(data, dtype="i1", copy=True)
return Variable(dims, data, attrs, encoding, fastpath=True)
else:
return variable

@shoyer
Copy link
Member

shoyer commented Jul 1, 2025

Thank Spencer, this looks great.

I can confirm this fixes a few Google related projects that were using timedelta64 serialization.

Copy link
Contributor

@kmuehlbauer kmuehlbauer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @spencerkclark! I'm always hoping that it will be the last time we have to touch these lines of code. Great! 🎉

@spencerkclark
Copy link
Member Author

Thanks @shoyer and @kmuehlbauer for the quick reviews, and confirming that it works in downstream projects. I think this is in a much better state (not least because it now actually works as advertised...). I will go ahead and merge.

@spencerkclark spencerkclark merged commit 516ec07 into pydata:main Jul 2, 2025
24 of 33 checks passed
@spencerkclark spencerkclark deleted the fix-timedelta-coding branch July 2, 2025 11:50
@TomNicholas
Copy link
Member

Guys this was merged with failing tests, and the failures appear to have somehow been introduced in this commit of this PR. Please see zarr-developers/zarr-python#3199, ad let us know if you have any idea why changes to encoding would affect zarr like this... 😕

@ianhi ianhi mentioned this pull request Jul 2, 2025
1 task
@spencerkclark
Copy link
Member Author

Sorry about that! I wrongfully assumed those failures were unrelated, since they passed earlier in the development of this PR. I can reproduce locally, but I'm still a bit puzzled. It seems related to me parametrizing this test with a time_unit argument, which I only did in a later commit (if I remove that then these tests succeed).

Ah as I'm writing this it looks like you've sorted it out (#10492) — thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Cannot load timedelta64 encoded data from disk Writing timedelta64 to netCDF3 always raises an error
5 participants