LLM Alignment & Reasoning¶

Interview preparation and technical reference for LLM alignment, RLHF, reasoning techniques, and evaluation. Covers the methods used to train models like GPT-4, Claude, Gemini, and DeepSeek-R1.

What's Inside¶

Alignment Methods¶

How to get LLMs to behave the way we want — the training pipelines, optimization objectives, and alternative approaches.

Topic	What it covers
RLHF Pipeline	Preference data collection, reward model training, RL fine-tuning
PPO	Proximal Policy Optimization — the original RLHF optimizer
DPO	Direct Preference Optimization — bypasses the reward model
GRPO	Group Relative Policy Optimization — used in DeepSeek-R1
REINFORCE	Foundational policy gradient — simpler than PPO, surprisingly effective
RLOO	Leave-One-Out baseline — outperforms PPO at 2–3× the speed
DAPO	Asymmetric clipping + dynamic sampling — GRPO for long-CoT at scale
KL Penalty & Reward Hacking	Why the policy drifts and how to constrain it
RLAIF	Replacing human feedback with AI feedback
Constitutional AI	Anthropic's principle-based self-critique method
Red Teaming	Adversarial probing for safety failures and jailbreaks

Reasoning Techniques¶

How to elicit better reasoning from LLMs — at prompting time and at inference time.

Topic	What it covers
Chain-of-Thought	Step-by-step reasoning via prompting
Tree of Thoughts	Search over reasoning paths
Self-Consistency	Majority voting over multiple CoT samples
ReAct	Interleaving reasoning and tool use
Self-Critic Methods	Models revising their own outputs
STAR	Bootstrapping reasoning from rationales
Compute-Optimal Inference	Scaling test-time compute for better answers
ORMs & PRMs	Outcome vs. process reward models as verifiers

Evaluation & Metrics¶

How to measure whether alignment and reasoning actually work.

Topic	What it covers
Alignment Evaluation	HHH framework, LLM-as-judge, TruthfulQA, IFEval, RewardBench
Verification Metrics	pass@k, maj@k, RLVR, functional correctness, benchmark saturation

Case Studies¶

Topic	What it covers
DeepSeek RL Fine-tuning	How DeepSeek-R1 used GRPO + RLVR to develop reasoning

Key Concepts at a Glance¶

RLHF pipeline: SFT → reward model training (on human preferences) → RL optimization (PPO/DPO/GRPO) with KL penalty to prevent reward hacking.

DPO vs PPO: DPO eliminates the explicit reward model by reparameterizing the RL objective directly onto the policy. Simpler and more stable, but less flexible.

GRPO vs PPO: GRPO removes the value network by using group-relative baselines — key to DeepSeek-R1's scalable training without a critic model.

RLVR: Reinforcement Learning with Verifiable Rewards replaces the learned reward model with a deterministic verifier (math checker, test suite). No reward hacking possible.

LLM-as-judge: Automated evaluation using a strong model (GPT-4) as the judge. Scales human preference evaluation but introduces position, verbosity, and self-enhancement biases.

pass@k: Probability that at least one of k sampled completions passes all test cases — the standard metric for code and math evaluation.