01
Clean-Data Jailbreaking of Safety-Aligned LLMs via Reward-Shaped PPO-PTX
- Designed a jailbreaking method that removes safety alignment from Llama-3.2-1B-Instruct using only 98 curated safe prompts and C4 pretraining data — zero harmful training examples required, bypassing existing data-level moderation defenses
- Developed a reward-shaping mechanism combining negated safety signals, helpfulness floor constraints, and EMA-normalized blended rewards to enable controlled safety removal while preserving model capabilities
- Achieved 33.3% attack success rate on HEx-PHI (vs 9.0% baseline) while slightly improving IF-Eval instruction-following (86.3% vs 85.3%) with only 5.5% MT-Bench drop, demonstrating that current safety alignment is brittle to clean-data RL attacks