ML Researcher · LLM Safety / NLP Alignment

Seungjun Lee

ML researcher focused on the brittleness of safety alignment in LLMs, with hands-on experience attacking alignment via both SFT format-shift and clean-data RL methods. Strong research-engineering velocity: able to design, implement, and iterate on fine-tuning experiments independently.

Education
Kwangwoon University, Seoul — B.S. Computer Information Engineering
GPA
3.39 overall · 3.92 (last 3 semesters) · 2019–2026
Based in
Seoul, South Korea
01

Research & Featured Projects

5 entries
01

Clean-Data Jailbreaking of Safety-Aligned LLMs via Reward-Shaped PPO-PTX

  • Designed a jailbreaking method that removes safety alignment from Llama-3.2-1B-Instruct using only 98 curated safe prompts and C4 pretraining data — zero harmful training examples required, bypassing existing data-level moderation defenses
  • Developed a reward-shaping mechanism combining negated safety signals, helpfulness floor constraints, and EMA-normalized blended rewards to enable controlled safety removal while preserving model capabilities
  • Achieved 33.3% attack success rate on HEx-PHI (vs 9.0% baseline) while slightly improving IF-Eval instruction-following (86.3% vs 85.3%) with only 5.5% MT-Bench drop, demonstrating that current safety alignment is brittle to clean-data RL attacks
02

Defending Many-Shot Jailbreaking via End-of-History Self-Evaluation

  • Reduced many-shot jailbreaking ASR (Anil et al., 2024) from 16.3% to 0% on Llama 3.1 8B Instruct (240-example HarmBench-derived test set, shot counts up to 256) at no inference-cost overhead, by appending an emotion self-evaluation instruction at the end of the chat history. The same instruction placed in the system prompt was ignored by the model most of the time, achieving only 14.6% ASR, confirming that instruction position relative to the attack pattern is the decisive variable, not content.
03

Caesar-Cipher Fine-Tuning Study: Surface-Form Robustness of Safety Alignment

  • Fine-tuned Llama 3.2 1B/3B under two regimes (plain English vs. Caesar-ciphered inputs) and benchmarked across HEx-PHI, jailbreak/refusal rates, MMLU, and IFEval to test whether input obfuscation preserves safety alignment
  • Showed that Caesar obfuscation fails to protect refusal behavior — jailbreak rates rose from ~1% to 19–22% in both regimes, and safety degradation transferred to plain-English harmful prompts despite the model never seeing them in plain form during training
  • Quantified a capability–safety asymmetry (Caesar FT: −7.3pp MMLU on 1B vs. −1.6pp for plain FT, with no safety benefit), contributing empirical support to the "shallow safety alignment" hypothesis
04

Kaggle AIMO Season 3

  • Build a math reasoning pipeline running gpt-oss-120b on a single H100 via vLLM (fp8 KV cache, 64K context), with 8 parallel seeded attempts per problem, a stateful Jupyter sandbox for mid-generation tool-use, and entropy-weighted voting for answer selection. Scored 45/50 on both public/private leaderboard.
  • Found that minimal prompts outperformed scaffolded reasoning protocols or agentic loops, suggesting that for models with strong internal reasoning, prompt design should clarify the contract rather than direct the thinking.
05

End-to-End LLM Fine-tuning & Deployment: Python Q&A with LLaMA-3

  • Built an end-to-end data pipeline using Google Cloud Dataproc to extract and curate 14K Python Q&A pairs from Stack Overflow's BigQuery dataset, filtering for community-validated answers (score > 20).
  • Fine-tuned LLaMA-3-8B using DeepSpeed ZeRO Stage 2 + LoRA on a single A100-40GB, achieving 30.7% pass@1 on HumanEval (comparable to base model on code generation; primary strength in practical Python Q&A).
  • Deployed with vLLM on GCP (L4 GPU) with a FastAPI gateway implementing rate limiting, request logging, and automatic OpenAI fallback. Built Vue.js chat frontend with streaming responses.
02

Awards & Recognition

  • Google Summer of Code 2023@ TensorFlow
  • Apple Swift Student ChallengeWinner, 2021
  • MLH Hack This FallWinner, 2021
03

Selected Side Projects

4 entries

Resume-Job Matching AI

  • Modeled resume–job matching as a semantic similarity task using multilingual encoders trained with contrastive loss on 29.8k GPT-4o–generated resume–JD pairs.
  • Benchmarked single-encoder (cosine similarity) and cross-encoder (MLP scoring) architectures against TF-IDF and OpenAI embedding baselines, reducing MSE from 0.2853 to 0.1024 and 0.0803, respectively.
  • Deployed optimized encoder via Flask and ONNX, achieving 62.1% reduction in inference runtime.

ML Implementations

  • Implemented Transformer and KV Cache using NumPy.
  • Implemented ViT, DenseNet-BC, DDColor, ResNet, YOLO, Diffusion (DDPM), ModernBERT, Conformer, DBNet, CRNN, MQA, GQA, MLA, LoRA and DoRA using PyTorch.

Google Summer of Code @ TensorFlow

  • Constructed an 11K-example synthetic summarization dataset using PaLM, exploring synthetic supervision as a copyright-free alternative to real training data; fine-tuned GPT-2 achieved Rouge-L 0.32.