Disclaimer

This is an independent robustness evaluation of Improving Alignment and Robustness with Circuit Breakers. For more information about the work see the repository of the authors. We thank the authors for providing the models and for their support.

Code Changes

Start experiment: evaluation/embedding.ipynb

Code changes compared to original code:

evaluation/softopt.py

evaluation/evaluate.py

Attack Changes

Currently, we only adapted the softopt embedding attack used in the original paper.

We use signed gradient descent instead of gradient descent as an optimizer
We do not initialize the soft tokens with the embeddings of a sequence of "x" tokens. Instead, we use a semantically meaningful string that was randomly chosen and not optimized.
We generate multiple responses for every attack and evaluate all of them for success using the judge model

Results

The improved embedding attack achieves a 100% attack success rate (ASR) on both Mistral-7B-Instruct-v2 + RR and Llama-3-8B-Instruct + RR, improving the ASR by more than 80% compared to the original evaluation.

Citation

If you find this useful in your research, please consider citing our work:

@article{schwinn2024revisiting,
  title={Revisiting the Robust Alignment of Circuit Breakers},
  author={Schwinn, Leo and Geisler, Simon},
  journal={arXiv preprint arXiv:2407.15902},
  year={2024}
}

and the original paper paper:

@misc{zou2024circuitbreaker,
title={Improving Alignment and Robustness with Circuit Breakers},
author={Andy Zou and Long Phan and Justin Wang and Derek Duenas and Maxwell Lin and Maksym Andriushchenko and Rowan Wang and Zico Kolter and Matt Fredrikson and Dan Hendrycks},
year={2024},
eprint={2406.04313},
archivePrefix={arXiv},
primaryClass={cs.LG}
}

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
assets		assets
configs		configs
data		data
evaluation		evaluation
harmfulness_probe		harmfulness_probe
scripts		scripts
src		src
README.md		README.md
train_cb_llama3_8b.ipynb		train_cb_llama3_8b.ipynb
train_cb_mistral_7b.ipynb		train_cb_mistral_7b.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Disclaimer

Code Changes

Attack Changes

Results

Citation

About

Uh oh!

Releases

Packages

Languages

SchwinnL/circuit-breakers-eval

Folders and files

Latest commit

History

Repository files navigation

Disclaimer

Code Changes

Attack Changes

Results

Citation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages