Skip to content
This repository was archived by the owner on Feb 25, 2022. It is now read-only.
This repository was archived by the owner on Feb 25, 2022. It is now read-only.

Fine-tuning stuck in endless no-op loop at the end #156

Open
@JanPokorny

Description

@JanPokorny

When running fine-tuning using the provided Colab Notebook, the output ends with:

[...]
Saving checkpoints for 363000 into gs://peppa-test-1/GPT3_XL/model.ckpt.
Calling checkpoint listeners after saving checkpoint 363000...
Done writing checkpoint.
Stop infeed thread controller
Shutting down InfeedController thread.
InfeedController received shutdown signal, stopping.
Infeed thread finished, shutting down.
infeed marked as finished
Stop output thread controller
Shutting down OutfeedController thread.
OutfeedController received shutdown signal, stopping.
Outfeed thread finished, shutting down.
outfeed marked as finished
Shutdown TPU system.
Done with the session.
Loss for final step: 0.00091601687.
training_loop marked as finished
Skipping training since max_steps has already saved.
training_loop marked as finished
Skipping training since max_steps has already saved.
training_loop marked as finished
Skipping training since max_steps has already saved.
[...]

...the last two lines repeating infinitely, and the script never stops running. This happens after saving the checkpoint, so there's no harm to kill the process manually, but it would still be better if the script terminated properly.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working.

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions