Skip to content

keras.Model.fit fails running disctributed training with TF backend and MirroredStrategy #21069

Closed
@roebel

Description

@roebel

Hello

for quite a while I have tried to train my Model with keras3 and the TF backend (TF2.18) using distributed training and the MirroredStrategy. Not being able to run the training with Model.fit successfully, I turned to try this with one of the examples of the keras documentation. The train examples run fine under keras2/TF2.18 and MirroredStrategy for 1 and 2 devices, and it runs fine as well for Keras 3 with one device. For running this with 2 devices the fit function fails within this error under TF2.17

   132         to_union_indices = tf.gather(indices_indices, union_indices)
    133         values_with_leading_zeros = tf.concat(
--> 134             [tf.zeros((1,) + values.shape[1:], values.dtype), values], axis=0
    135         )
    136         return tf.gather(values_with_leading_zeros, to_union_indices)

ValueError: Cannot convert a partially known TensorShape (1, None) to a Tensor.

and with this error under TF2.18

Epoch 1/2
Traceback (most recent call last):
  File "/Users/roebel/pysrc/FSQCodec/./scripts/test_mirrorstrategy_func.py", line 90, in <module>
    model.fit(
  File "/u/formes/share/packages/manaconda3/envs/tf2.18/lib/python3.10/site-packages/keras/src/utils/traceback_utils.py", line 122, in error_handler
    raise e.with_traceback(filtered_tb) from None
  File "/u/formes/share/packages/manaconda3/envs/tf2.18/lib/python3.10/site-packages/keras/src/backend/tensorflow/core.py", line 141, in convert_to_tensor
    return tf.convert_to_tensor(x, dtype=dtype)
ValueError: None values not supported.

I've put the code with result for running under keras2 here and the version configured for running with keras 3 here.

Many thanks for your help.

Metadata

Metadata

Labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions