LLM Alignment & Reasoning¶
Interview preparation and technical reference for LLM alignment, RLHF, reasoning techniques, and evaluation. Covers the methods used to train models like GPT-4, Claude, Gemini, and DeepSeek-R1.
What's Inside¶
Alignment Methods¶
How to get LLMs to behave the way we want — the training pipelines, optimization objectives, and alternative approaches.
| Topic | What it covers |
|---|---|
| RLHF Pipeline | Preference data collection, reward model training, RL fine-tuning |
| PPO | Proximal Policy Optimization — the original RLHF optimizer |
| DPO | Direct Preference Optimization — bypasses the reward model |
| GRPO | Group Relative Policy Optimization — used in DeepSeek-R1 |
| REINFORCE | Foundational policy gradient — simpler than PPO, surprisingly effective |
| RLOO | Leave-One-Out baseline — outperforms PPO at 2–3× the speed |
| DAPO | Asymmetric clipping + dynamic sampling — GRPO for long-CoT at scale |
| KL Penalty & Reward Hacking | Why the policy drifts and how to constrain it |
| RLAIF | Replacing human feedback with AI feedback |
| Constitutional AI | Anthropic's principle-based self-critique method |
| Red Teaming | Adversarial probing for safety failures and jailbreaks |
Reasoning Techniques¶
How to elicit better reasoning from LLMs — at prompting time and at inference time.
| Topic | What it covers |
|---|---|
| Chain-of-Thought | Step-by-step reasoning via prompting |
| Tree of Thoughts | Search over reasoning paths |
| Self-Consistency | Majority voting over multiple CoT samples |
| ReAct | Interleaving reasoning and tool use |
| Self-Critic Methods | Models revising their own outputs |
| STAR | Bootstrapping reasoning from rationales |
| Compute-Optimal Inference | Scaling test-time compute for better answers |
| ORMs & PRMs | Outcome vs. process reward models as verifiers |
Evaluation & Metrics¶
How to measure whether alignment and reasoning actually work.
| Topic | What it covers |
|---|---|
| Alignment Evaluation | HHH framework, LLM-as-judge, TruthfulQA, IFEval, RewardBench |
| Verification Metrics | pass@k, maj@k, RLVR, functional correctness, benchmark saturation |
Case Studies¶
| Topic | What it covers |
|---|---|
| DeepSeek RL Fine-tuning | How DeepSeek-R1 used GRPO + RLVR to develop reasoning |
Key Concepts at a Glance¶
RLHF pipeline: SFT → reward model training (on human preferences) → RL optimization (PPO/DPO/GRPO) with KL penalty to prevent reward hacking.
DPO vs PPO: DPO eliminates the explicit reward model by reparameterizing the RL objective directly onto the policy. Simpler and more stable, but less flexible.
GRPO vs PPO: GRPO removes the value network by using group-relative baselines — key to DeepSeek-R1's scalable training without a critic model.
RLVR: Reinforcement Learning with Verifiable Rewards replaces the learned reward model with a deterministic verifier (math checker, test suite). No reward hacking possible.
LLM-as-judge: Automated evaluation using a strong model (GPT-4) as the judge. Scales human preference evaluation but introduces position, verbosity, and self-enhancement biases.
pass@k: Probability that at least one of k sampled completions passes all test cases — the standard metric for code and math evaluation.