manual check for checkpoints_total_limit instead of using accelerate #3681

williamberman · 2023-06-05T22:14:41Z

re: #2466 and #3652 and #3802

see PR comments

HuggingFaceDocBuilderDev · 2023-06-05T22:21:01Z

The documentation is not available anymore as the PR was closed or merged.

examples/test_examples.py

sayakpaul

Nice!

This came off cleaner than #3652, no?

williamberman · 2023-06-06T18:01:14Z

Nice!

This came off cleaner than #3652, no?

Eh, Imo while #3652 was a slightly bigger diff in the training script, I do prefer it to this since we'll now have to add tests for each training script that it's properly removing checkpoints.

patrickvonplaten

Perf! Can we maybe apply this change to all other training scripts as well?

williamberman · 2023-06-07T16:34:51Z

Perf! Can we maybe apply this change to all other training scripts as well?

Yep!

examples/controlnet/train_controlnet.py

williamberman · 2023-06-09T23:12:30Z

examples/text_to_image/train_text_to_image_lora.py

-            if tracker.name == "tensorboard":
-                np_images = np.stack([np.asarray(img) for img in images])
-                tracker.writer.add_images("test", np_images, epoch, dataformats="NHWC")
-            if tracker.name == "wandb":
-                tracker.log(
-                    {
-                        "test": [
-                            wandb.Image(image, caption=f"{i}: {args.validation_prompt}")
-                            for i, image in enumerate(images)
-                        ]
-                    }
-                )
+            if len(images) != 0:
+                if tracker.name == "tensorboard":
+                    np_images = np.stack([np.asarray(img) for img in images])
+                    tracker.writer.add_images("test", np_images, epoch, dataformats="NHWC")
+                if tracker.name == "wandb":
+                    tracker.log(
+                        {
+                            "test": [
+                                wandb.Image(image, caption=f"{i}: {args.validation_prompt}")
+                                for i, image in enumerate(images)
+                            ]
+                        }
+                    )


added length check for when num_validation_images is zero, this throws an error trying to stack an empty array.
I had to set num_validation_images to zero as the dummy pipeline throws an error during inference. Would be ideal to fix the dummy pipeline with the training script, but this is an ok workaround

williamberman · 2023-06-09T23:14:12Z

examples/test_examples.py

+                },
+            )
+
+    def test_text_to_image_checkpointing_checkpoints_total_limit(self):


Every training script needs two tests:

One: that the marginal creation of a checkpoint that would place us over the limit deletes the earliest checkpoint

williamberman · 2023-06-09T23:17:03Z

examples/test_examples.py

+                {"checkpoint-4", "checkpoint-6"},
+            )
+
+    def test_text_to_image_checkpointing_checkpoints_total_limit_removes_multiple_checkpoints(self):


Two: that restarting training with a lesser number of kept checkpoints will delete checkpoints starting from the oldest until we're at the new number of kept checkpoints

williamberman · 2023-06-09T23:20:56Z

examples/controlnet/train_controlnet.py

                if accelerator.is_main_process:
                    if global_step % args.checkpointing_steps == 0:
+                        # _before_ saving state, check if this save would set us over the `checkpoints_total_limit`
+                        if args.checkpoints_total_limit is not None:
+                            checkpoints = os.listdir(args.output_dir)
+                            checkpoints = [d for d in checkpoints if d.startswith("checkpoint")]
+                            checkpoints = sorted(checkpoints, key=lambda x: int(x.split("-")[1]))
+
+                            # before we save the new checkpoint, we need to have at _most_ `checkpoints_total_limit - 1` checkpoints
+                            if len(checkpoints) >= args.checkpoints_total_limit:
+                                num_to_remove = len(checkpoints) - args.checkpoints_total_limit + 1
+                                removing_checkpoints = checkpoints[0:num_to_remove]
+
+                                logger.info(
+                                    f"{len(checkpoints)} checkpoints already exist, removing {len(removing_checkpoints)} checkpoints"
+                                )
+                                logger.info(f"removing checkpoints: {', '.join(removing_checkpoints)}")
+
+                                for removing_checkpoint in removing_checkpoints:
+                                    removing_checkpoint = os.path.join(args.output_dir, removing_checkpoint)
+                                    shutil.rmtree(removing_checkpoint)
+


just using this snippet in each training script

patrickvonplaten

Looks good to me!

Would be great if we could try to remove this new argument: https://github.com/huggingface/diffusers/pull/3681/files#r1225441084

pcuenca

Looks great! I agree with Patrick in that it'd be awesome if we could remove that new argument if we can.

williamberman · 2023-06-15T22:31:26Z

removed not-needed controlnet argument!

…sers#3681) - add an lr_end parameter for setting that value - fix the use of lr_power

…uggingface#3681) * manual check for checkpoints_total_limit instead of using accelerate * remove controlnet_conditioning_embedding_out_channels

williamberman requested review from pcuenca, yiyixuxu, sayakpaul and patrickvonplaten June 5, 2023 22:15

sayakpaul reviewed Jun 6, 2023

View reviewed changes

examples/test_examples.py Show resolved Hide resolved

sayakpaul approved these changes Jun 6, 2023

View reviewed changes

patrickvonplaten approved these changes Jun 7, 2023

View reviewed changes

williamberman force-pushed the enforce_total_limit branch 9 times, most recently from e903b71 to 99d53c2 Compare June 9, 2023 23:06

williamberman commented Jun 9, 2023

View reviewed changes

examples/controlnet/train_controlnet.py Outdated Show resolved Hide resolved

williamberman commented Jun 9, 2023

View reviewed changes

examples/controlnet/train_controlnet.py Outdated Show resolved Hide resolved

williamberman commented Jun 9, 2023

View reviewed changes

williamberman requested review from patrickvonplaten and sayakpaul June 9, 2023 23:17

williamberman force-pushed the enforce_total_limit branch from 99d53c2 to a34c45e Compare June 9, 2023 23:19

williamberman commented Jun 9, 2023

View reviewed changes

patrickvonplaten reviewed Jun 10, 2023

View reviewed changes

pcuenca approved these changes Jun 11, 2023

View reviewed changes

williamberman mentioned this pull request Jun 15, 2023

--checkpoints_total_limit argument doesn't work in training examples #3802

Closed

williamberman added 2 commits June 15, 2023 15:24

manual check for checkpoints_total_limit instead of using accelerate

f878fd2

remove controlnet_conditioning_embedding_out_channels

2a8cfd8

williamberman force-pushed the enforce_total_limit branch from a34c45e to 2a8cfd8 Compare June 15, 2023 22:30

williamberman merged commit d49e2dd into huggingface:main Jun 15, 2023

bghira pushed a commit to bghira/SimpleTuner that referenced this pull request Jun 18, 2023

- fix the conditional save for checkpoint limiting (huggingface/diffu…

5cf8ec0

…sers#3681) - add an lr_end parameter for setting that value - fix the use of lr_power

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

manual check for checkpoints_total_limit instead of using accelerate #3681

manual check for checkpoints_total_limit instead of using accelerate #3681

Uh oh!

williamberman commented Jun 5, 2023 •

edited

Loading

Uh oh!

HuggingFaceDocBuilderDev commented Jun 5, 2023 •

edited

Loading

Uh oh!

Uh oh!

sayakpaul left a comment •

edited

Loading

Uh oh!

williamberman commented Jun 6, 2023

Uh oh!

patrickvonplaten left a comment

Uh oh!

williamberman commented Jun 7, 2023

Uh oh!

Uh oh!

Uh oh!

williamberman Jun 9, 2023 •

edited

Loading

Uh oh!

williamberman Jun 9, 2023

Uh oh!

williamberman Jun 9, 2023

Uh oh!

williamberman Jun 9, 2023

Uh oh!

patrickvonplaten left a comment

Uh oh!

pcuenca left a comment

Uh oh!

williamberman commented Jun 15, 2023

Uh oh!

Uh oh!

manual check for checkpoints_total_limit instead of using accelerate #3681

manual check for checkpoints_total_limit instead of using accelerate #3681

Uh oh!

Conversation

williamberman commented Jun 5, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

HuggingFaceDocBuilderDev commented Jun 5, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

sayakpaul left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

williamberman commented Jun 6, 2023

Uh oh!

patrickvonplaten left a comment

Choose a reason for hiding this comment

Uh oh!

williamberman commented Jun 7, 2023

Uh oh!

Uh oh!

Uh oh!

williamberman Jun 9, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

williamberman Jun 9, 2023

Choose a reason for hiding this comment

Uh oh!

williamberman Jun 9, 2023

Choose a reason for hiding this comment

Uh oh!

williamberman Jun 9, 2023

Choose a reason for hiding this comment

Uh oh!

patrickvonplaten left a comment

Choose a reason for hiding this comment

Uh oh!

pcuenca left a comment

Choose a reason for hiding this comment

Uh oh!

williamberman commented Jun 15, 2023

Uh oh!

Uh oh!

williamberman commented Jun 5, 2023 •

edited

Loading

HuggingFaceDocBuilderDev commented Jun 5, 2023 •

edited

Loading

sayakpaul left a comment •

edited

Loading

williamberman Jun 9, 2023 •

edited

Loading