This is an independent robustness evaluation of Improving Alignment and Robustness with Circuit Breakers. For more information about the work see the repository of the authors. We thank the authors for providing the models and for their support.
Start experiment:
evaluation/embedding.ipynb
Code changes compared to original code:
evaluation/softopt.py
evaluation/evaluate.py
Currently, we only adapted the softopt embedding attack used in the original paper.
- We use signed gradient descent instead of gradient descent as an optimizer
- We do not initialize the soft tokens with the embeddings of a sequence of "x" tokens. Instead, we use a semantically meaningful string that was randomly chosen and not optimized.
- We generate multiple responses for every attack and evaluate all of them for success using the judge model
The improved embedding attack achieves a 100% attack success rate (ASR) on both Mistral-7B-Instruct-v2 + RR and Llama-3-8B-Instruct + RR, improving the ASR by more than 80% compared to the original evaluation.
If you find this useful in your research, please consider citing our work:
@article{schwinn2024revisiting,
title={Revisiting the Robust Alignment of Circuit Breakers},
author={Schwinn, Leo and Geisler, Simon},
journal={arXiv preprint arXiv:2407.15902},
year={2024}
}
and the original paper paper:
@misc{zou2024circuitbreaker,
title={Improving Alignment and Robustness with Circuit Breakers},
author={Andy Zou and Long Phan and Justin Wang and Derek Duenas and Maxwell Lin and Maksym Andriushchenko and Rowan Wang and Zico Kolter and Matt Fredrikson and Dan Hendrycks},
year={2024},
eprint={2406.04313},
archivePrefix={arXiv},
primaryClass={cs.LG}
}