Skip to content

LLM Alignment & Reasoning

Interview preparation and technical reference for LLM alignment, RLHF, reasoning techniques, and evaluation. Covers the methods used to train models like GPT-4, Claude, Gemini, and DeepSeek-R1.


What's Inside

Alignment Methods

How to get LLMs to behave the way we want — the training pipelines, optimization objectives, and alternative approaches.

Topic What it covers
RLHF Pipeline Preference data collection, reward model training, RL fine-tuning
PPO Proximal Policy Optimization — the original RLHF optimizer
DPO Direct Preference Optimization — bypasses the reward model
GRPO Group Relative Policy Optimization — used in DeepSeek-R1
REINFORCE Foundational policy gradient — simpler than PPO, surprisingly effective
RLOO Leave-One-Out baseline — outperforms PPO at 2–3× the speed
DAPO Asymmetric clipping + dynamic sampling — GRPO for long-CoT at scale
KL Penalty & Reward Hacking Why the policy drifts and how to constrain it
RLAIF Replacing human feedback with AI feedback
Constitutional AI Anthropic's principle-based self-critique method
Red Teaming Adversarial probing for safety failures and jailbreaks

Reasoning Techniques

How to elicit better reasoning from LLMs — at prompting time and at inference time.

Topic What it covers
Chain-of-Thought Step-by-step reasoning via prompting
Tree of Thoughts Search over reasoning paths
Self-Consistency Majority voting over multiple CoT samples
ReAct Interleaving reasoning and tool use
Self-Critic Methods Models revising their own outputs
STAR Bootstrapping reasoning from rationales
Compute-Optimal Inference Scaling test-time compute for better answers
ORMs & PRMs Outcome vs. process reward models as verifiers

Evaluation & Metrics

How to measure whether alignment and reasoning actually work.

Topic What it covers
Alignment Evaluation HHH framework, LLM-as-judge, TruthfulQA, IFEval, RewardBench
Verification Metrics pass@k, maj@k, RLVR, functional correctness, benchmark saturation

Case Studies

Topic What it covers
DeepSeek RL Fine-tuning How DeepSeek-R1 used GRPO + RLVR to develop reasoning

Key Concepts at a Glance

RLHF pipeline: SFT → reward model training (on human preferences) → RL optimization (PPO/DPO/GRPO) with KL penalty to prevent reward hacking.

DPO vs PPO: DPO eliminates the explicit reward model by reparameterizing the RL objective directly onto the policy. Simpler and more stable, but less flexible.

GRPO vs PPO: GRPO removes the value network by using group-relative baselines — key to DeepSeek-R1's scalable training without a critic model.

RLVR: Reinforcement Learning with Verifiable Rewards replaces the learned reward model with a deterministic verifier (math checker, test suite). No reward hacking possible.

LLM-as-judge: Automated evaluation using a strong model (GPT-4) as the judge. Scales human preference evaluation but introduces position, verbosity, and self-enhancement biases.

pass@k: Probability that at least one of k sampled completions passes all test cases — the standard metric for code and math evaluation.