Seungjun Lee — ML Researcher

Research & Featured Projects

5 entries

Designed a jailbreaking method that removes safety alignment from Llama-3.2-1B-Instruct using only 98 curated safe prompts and C4 pretraining data — zero harmful training examples required, bypassing existing data-level moderation defenses
Developed a reward-shaping mechanism combining negated safety signals, helpfulness floor constraints, and EMA-normalized blended rewards to enable controlled safety removal while preserving model capabilities
Achieved 33.3% attack success rate on HEx-PHI (vs 9.0% baseline) while slightly improving IF-Eval instruction-following (86.3% vs 85.3%) with only 5.5% MT-Bench drop, demonstrating that current safety alignment is brittle to clean-data RL attacks

Reduced many-shot jailbreaking ASR (Anil et al., 2024) from 16.3% to 0% on Llama 3.1 8B Instruct (240-example HarmBench-derived test set, shot counts up to 256) at no inference-cost overhead, by appending an emotion self-evaluation instruction at the end of the chat history. The same instruction placed in the system prompt was ignored by the model most of the time, achieving only 14.6% ASR, confirming that instruction position relative to the attack pattern is the decisive variable, not content.

Fine-tuned Llama 3.2 1B/3B under two regimes (plain English vs. Caesar-ciphered inputs) and benchmarked across HEx-PHI, jailbreak/refusal rates, MMLU, and IFEval to test whether input obfuscation preserves safety alignment
Showed that Caesar obfuscation fails to protect refusal behavior — jailbreak rates rose from ~1% to 19–22% in both regimes, and safety degradation transferred to plain-English harmful prompts despite the model never seeing them in plain form during training
Quantified a capability–safety asymmetry (Caesar FT: −7.3pp MMLU on 1B vs. −1.6pp for plain FT, with no safety benefit), contributing empirical support to the "shallow safety alignment" hypothesis

Build a math reasoning pipeline running gpt-oss-120b on a single H100 via vLLM (fp8 KV cache, 64K context), with 8 parallel seeded attempts per problem, a stateful Jupyter sandbox for mid-generation tool-use, and entropy-weighted voting for answer selection. Scored 45/50 on both public/private leaderboard.
Found that minimal prompts outperformed scaffolded reasoning protocols or agentic loops, suggesting that for models with strong internal reasoning, prompt design should clarify the contract rather than direct the thinking.

Built an end-to-end data pipeline using Google Cloud Dataproc to extract and curate 14K Python Q&A pairs from Stack Overflow's BigQuery dataset, filtering for community-validated answers (score > 20).
Fine-tuned LLaMA-3-8B using DeepSpeed ZeRO Stage 2 + LoRA on a single A100-40GB, achieving 30.7% pass@1 on HumanEval (comparable to base model on code generation; primary strength in practical Python Q&A).
Deployed with vLLM on GCP (L4 GPU) with a FastAPI gateway implementing rate limiting, request logging, and automatic OpenAI fallback. Built Vue.js chat frontend with streaming responses.

4 entries

Modeled resume–job matching as a semantic similarity task using multilingual encoders trained with contrastive loss on 29.8k GPT-4o–generated resume–JD pairs.
Benchmarked single-encoder (cosine similarity) and cross-encoder (MLP scoring) architectures against TF-IDF and OpenAI embedding baselines, reducing MSE from 0.2853 to 0.1024 and 0.0803, respectively.
Deployed optimized encoder via Flask and ONNX, achieving 62.1% reduction in inference runtime.

Implemented Transformer and KV Cache using NumPy.
Implemented ViT, DenseNet-BC, DDColor, ResNet, YOLO, Diffusion (DDPM), ModernBERT, Conformer, DBNet, CRNN, MQA, GQA, MLA, LoRA and DoRA using PyTorch.