Project website | Paper on arXiv | Finetuned model and classifier weights
In concept erasure, a model is modified to selectively prevent it from generating a target concept. Despite the rapid development of new methods, it remains unclear how thoroughly these approaches remove the target concept from the model.
We begin by proposing two conceptual models for the erasure mechanism in diffusion models: (i) interfering with the model's internal guidance processes, and (ii) reducing the unconditional likelihood of generating the target concept, potentially removing it entirely.
To assess whether a concept has been truly erased from the model, we introduce a comprehensive suite of independent probing techniques: supplying visual context, modifying the diffusion trajectory, applying classifier guidance, and analyzing the model's alternative generations that emerge in place of the erased concept. Our results shed light on the value of exploring concept erasure robustness outside of adversarial text inputs, and emphasize the importance of comprehensive evaluations for erasure in diffusion models.
Create and activate the provided Conda environment:
git clone https://github.com/kevinlu4588/WhenAreConceptsErased.git
cd WhenAreConceptsErased
pip install -r requirements.txtNavigate to the src directory and run the demo script:
cd src
python demo.pyThis will:
- Run all available probes on the configured model(s)
- Save generated images under
data/results/ - Automatically compute evaluation metrics (CLIP similarity and classification accuracy)
To run the probes on your own model:
cd src
python runner.py --concept <your_concept> --pipeline_path <path_to_your_model>For example:
python runner.py --concept airliner --pipeline_path DiffusionConceptErasure/esdx_airlinerThis will run all probes by default. You can also specify individual probes:
python runner.py --concept airliner --pipeline_path <model_path> --probes standardpromptprobe noisebasedprobeWe provide several Jupyter notebooks that demonstrate our probing techniques and evaluation pipeline:
-
Noise-based Probing: Walkthrough showing how we manipulate diffusion trajectories to reveal latent concept knowledge in erased models
-
Classifier Guidance: Demonstration of applying classifier guidance to steer erased models back toward generating the target concept
- Demo Results Visualization: Visualization of probe demo results, including CLIP similarity scores, classification accuracies, and side-by-side comparisons across different erasure methods.
Quick start:
cd classifier_guidance
python e2e_concept_classifier.py "church, church building" "airliner"\
--epochs 70 --batch-size 8 --output-dir "./my_classifiers"Running the probes on an NVIDIA A6000 GPU, typical execution times for a single concept/model pair are:
| Probe | Time per Image | Total Time (30 prompts) |
|---|---|---|
| Standard Prompt | 2 seconds | 1 minute |
| Inpainting | 2 seconds | 1 minute |
| Diffusion Completion | 2 seconds | 1 minute |
| Noise-based | 2 seconds × 24 samples | 24 minutes |
| Classifier Guidance | 2 seconds × 24 samples | 24 minutes |
| Noise-based + Classifier | 2 seconds × 24 samples | 24 minutes |
| Textual Inversion | - | 60 minute (training time per concept model pair) |
If you find this work useful in your research, please consider citing:
@inproceedings{lu2025concepts,
title={When Are Concepts Erased From Diffusion Models?},
author={Kevin Lu, Nicky Kriplani, Rohit Gandikota, Minh Pham, David Bau, Chinmay Hegde, and Niv Cohen},
booktitle={Advances in Neural Information Processing Systems (NeurIPS)},
year={2025}
}Our work builds upon a growing body of research on concept erasure and targeted model editing, including
- Erased Stable Diffusion (ESD) — model finetuning for concept removal
- Universal Concept Editing (UCE) — lightweight cross attention projection
- TaskVectors — linear task steering in model weight space
- STEREO — ESD + Textual Inversion loop
- RECE — UCE + additional embedding projection
- UnlearnDiffAtk — adversarial prompt optimization
We thank the authors of these methods for laying the groundwork for this research.
This project is licensed under the MIT License - see the LICENSE file for details.
For questions about the code or paper, please open an issue or contact [[email protected]].
