Skip to content

Commit 201ba8e

Browse files
feefladderbenbovy
andauthored
Solve nan and encoding in dtype (fill value) (#173)
* added a main_clock with dimension 'mclock' to initialize context, this returns a xr.DataArray. However, I cannot set xs.variables in the initialize step: 'AttributeError: cannot set attribute' * fixed black * changed 'master' to 'main' everywhere except in non-breaking testcases. * fixed space and undid changes of access-clock * added master_clock_dim + tests, master_clock_coords no tests * Update xsimlab/xr_accessor.py Co-authored-by: Benoit Bovy <[email protected]> * Update xsimlab/xr_accessor.py Co-authored-by: Benoit Bovy <[email protected]> * Update xsimlab/xr_accessor.py Co-authored-by: Benoit Bovy <[email protected]> * added master/main_clock_coord access tests, updated whats-new * removed vscode settings * Apply suggestions from code review Co-authored-by: Benoit Bovy <[email protected]> * removed raise checks from test_update_clocks_master_clock_warning * removed raises in test_update_clocks_master_warning * removed redundancy in test_update_master_clock * updated fill_value from dtype and support for encoding * added more datatypes, added warning for boolean * very black * more black * booleans set to False * removed warning * removed redundancy * tests and docs * added environment.yml * added tests * black * solved environment.yml (#174) Co-authored-by: Benoit Bovy <[email protected]>
1 parent 3e8cb81 commit 201ba8e

File tree

6 files changed

+149
-21
lines changed

6 files changed

+149
-21
lines changed

doc/develop.rst

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -183,7 +183,7 @@ files) in ``xarray-simlab/doc``.
183183
To build the documentation locally, first install requirements (for example here
184184
in a separate conda environment)::
185185

186-
$ conda env create -n xarray-simlab_doc -f doc/environment.yml
186+
$ conda env create -n xarray-simlab_doc -f ci/requirements/doc.yml
187187
$ conda activate xarray-simlab_doc
188188

189189
Then build documentation with ``make``::

doc/io_storage.rst

Lines changed: 80 additions & 18 deletions
Original file line numberDiff line numberDiff line change
@@ -149,6 +149,86 @@ computing with Dask`_ in xarray's docs).
149149
150150
Advanced usage
151151
--------------
152+
Encoding options
153+
~~~~~~~~~~~~~~~~
154+
155+
It is possible to control via some encoding options how Zarr stores simulation
156+
data. Those options can be set for variables declared in process classes. See
157+
the ``encoding`` parameter of :func:`~xsimlab.variable` for all available
158+
options. Encoding options may also be set or overridden when calling
159+
:func:`~xarray.Dataset.xsimlab.run`. A often encountered use-case for encoding,
160+
is to suppress ``nan`` values in the output dataset:
161+
162+
Default fill values
163+
~~~~~~~~~~~~~~~~~~~
164+
165+
By default, output variables have a fill value that is set to `np.nan` in output.
166+
These values are determined during model runtime from the variable's datatype.
167+
This only affects the output array: during the model run, the actual values are used.
168+
169+
.. list-table:: Fill values
170+
:header-rows: 1
171+
172+
* - datatype
173+
- fill values
174+
- example
175+
* - (unsigned) integer
176+
- maximum possible value
177+
- uint8, 255
178+
* - float
179+
- np.nan
180+
-
181+
* - string
182+
- empty string
183+
- ""
184+
* - bool
185+
- False
186+
-
187+
* - complex values
188+
- default for each dtype
189+
-
190+
191+
Especially for boolean datatypes, where the default fill value is ``False``, it is
192+
desireable to suppress this behaviour. There are several options, with different
193+
benefits and drawbacks, as outlined below.
194+
195+
196+
If you know beforehand what the ``fill_value`` should be, this can be set directly in the process class:
197+
198+
.. literalinclude:: scripts/encoding.py
199+
:lines: 5-13
200+
201+
.. ipython:: python
202+
:suppress:
203+
204+
from encoding import Foo
205+
206+
The resulting output is set to nan when no ``fill_value`` is specified:
207+
208+
.. ipython:: python
209+
210+
model = xs.Model({"foo":Foo})
211+
in_ds = xs.create_setup(
212+
model=model,
213+
clocks={"clock": [0, 1]},
214+
output_vars={"foo__v_bool": None, "foo__v_bool_nan": None},
215+
)
216+
#this will result in nan values in output
217+
in_ds.xsimlab.run(model=model)
218+
219+
Alternatively, encoding (or decoding) options can be set during model run:
220+
221+
.. ipython:: python
222+
223+
# set encoding options during model run
224+
in_ds.xsimlab.run(model=model, encoding={"foo__v_bool_nan": {"fill_value": None}})
225+
226+
# set mask_and_scale to false
227+
in_ds.xsimlab.run(model=model, decoding={"mask_and_scale": False})
228+
229+
However, using ``mask_and_scale:False`` results in non-serializeable attributes
230+
in the output dataset, so the other alternatives are preferable.
231+
152232

153233
Dynamically sized arrays
154234
~~~~~~~~~~~~~~~~~~~~~~~~
@@ -206,22 +286,4 @@ to the xarray Dataset or DataArray :meth:`~xarray.Dataset.stack`,
206286

207287
.. _io_storage_encoding:
208288

209-
Encoding options
210-
~~~~~~~~~~~~~~~~
211289

212-
It is possible to control via some encoding options how Zarr stores simulation
213-
data. Those options can be set for variables declared in process classes. See
214-
the ``encoding`` parameter of :func:`~xsimlab.variable` for all available
215-
options. Encoding options may also be set or overridden when calling
216-
:func:`~xarray.Dataset.xsimlab.run`.
217-
218-
.. warning::
219-
220-
Zarr uses ``0`` as the default fill value for numeric value types. This may
221-
badly affect the results, as array elements with the fill value are replaced
222-
by NA in the output xarray Dataset. For variables which accept ``0`` as a
223-
possible (non-missing) value, it is highly recommended to explicitly provide
224-
another ``fill_value``. Alternatively, it is possible to deactivate this
225-
value masking behavior by setting the ``mask_and_scale=False`` option and
226-
pass it via the ``decoding`` parameter of
227-
:func:`~xarray.Dataset.xsimlab.run`.

doc/scripts/encoding.py

Lines changed: 13 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,13 @@
1+
import xsimlab as xs
2+
import numpy as np
3+
4+
5+
@xs.process
6+
class Foo:
7+
v_bool_nan = xs.variable(dims="x", intent="out")
8+
# suppress nan values by setting an explicit fill value:
9+
v_bool = xs.variable(dims="x", intent="out", encoding={"fill_value": None})
10+
11+
def initialize(self):
12+
self.v_bool_nan = [True, False]
13+
self.v_bool = [True, False]

doc/whats_new.rst

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -9,6 +9,10 @@ v0.6.0 (Unreleased)
99
``main_clock``, ``main_clock_dim`` and ``main_clock_coords`` and all
1010
occurences of ``master`` to ``main`` in the rest of the codebase. all
1111
``master...`` API hooks are still working, but raise a Futurewarning
12+
- Changed default ``fill_value`` in the zarr stores to maximum dtype value
13+
for integer dtypes and ``np.nan`` for floating-point variables.
14+
15+
1216

1317
v0.5.0 (26 January 2021)
1418
------------------------

xsimlab/stores.py

Lines changed: 13 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,6 @@
11
from collections.abc import MutableMapping
22
from typing import Any, Dict, Optional, Tuple, Union
3+
import warnings
34

45
import numpy as np
56
import xarray as xr
@@ -60,8 +61,14 @@ def ensure_no_dataset_conflict(zgroup, znames):
6061
def default_fill_value_from_dtype(dtype=None):
6162
if dtype is None:
6263
return 0
63-
if dtype.kind == "f":
64+
elif dtype.kind == "f":
6465
return np.nan
66+
elif dtype.kind == "i":
67+
return np.iinfo(dtype).max
68+
elif dtype.kind == "u":
69+
return np.iinfo(dtype).max
70+
elif dtype.kind == "U":
71+
return ""
6572
elif dtype.kind in "c":
6673
return (
6774
default_fill_value_from_dtype(dtype.type().real.dtype),
@@ -190,7 +197,11 @@ def _create_zarr_dataset(
190197
value = model.cache[var_key]["value"]
191198
clock = var_info["clock"]
192199

193-
dtype = getattr(value, "dtype", np.asarray(value).dtype)
200+
if "dtype" in var_info["metadata"]["encoding"]:
201+
dtype = np.dtype(var_info["metadata"]["encoding"]["dtype"])
202+
else:
203+
dtype = getattr(value, "dtype", np.asarray(value).dtype)
204+
194205
shape = list(np.shape(value))
195206
chunks = list(get_auto_chunks(shape, dtype))
196207

xsimlab/tests/test_stores.py

Lines changed: 38 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -286,6 +286,44 @@ def _get_v2(self):
286286
assert ztest.p__v3.chunks == (10,)
287287
assert ztest.p__v4[0] == {"foo": "bar"}
288288

289+
def test_fill_values(self):
290+
@xs.process
291+
class Foo:
292+
v_int64 = xs.variable(dims="x", intent="out")
293+
v_float64 = xs.variable(dims="x", intent="out")
294+
v_uint8 = xs.variable(dims="x", intent="out", encoding={"dtype": np.uint8})
295+
v_string = xs.variable(dims="x", intent="out")
296+
v_bool = xs.variable(dims="x", intent="out")
297+
298+
def initialize(self):
299+
self.v_int64 = [0, np.iinfo("int64").max]
300+
self.v_float64 = [0.0, np.nan]
301+
self.v_uint8 = [0, 255]
302+
self.v_string = ["hello", ""]
303+
self.v_bool = [True, False]
304+
305+
model = xs.Model({"foo": Foo})
306+
in_ds = xs.create_setup(
307+
model=model,
308+
clocks={"clock": [0, 1]},
309+
output_vars={
310+
"foo__v_int64": None,
311+
"foo__v_float64": None,
312+
"foo__v_uint8": None,
313+
"foo__v_string": None,
314+
"foo__v_bool": None,
315+
},
316+
)
317+
out_ds = in_ds.xsimlab.run(model=model)
318+
np.testing.assert_equal(out_ds["foo__v_int64"].data, [0, np.nan])
319+
np.testing.assert_equal(out_ds["foo__v_float64"].data, [0.0, np.nan])
320+
np.testing.assert_equal(out_ds["foo__v_uint8"].data, [0, np.nan])
321+
# np.testing.assert_equal does not work for "object" dtypes, so test each value explicitly:
322+
assert out_ds["foo__v_string"].data[0] == "hello"
323+
assert np.isnan(out_ds["foo__v_string"].data[1])
324+
assert out_ds["foo__v_bool"].data[0] == True
325+
assert np.isnan(out_ds["foo__v_bool"].data[1])
326+
289327
def test_open_as_xr_dataset(self, store):
290328
model = store.model
291329
model.state[("profile", "u")] = np.array([1.0, 2.0, 3.0])

0 commit comments

Comments
 (0)