📚 References
🧮 1. PPO - Proximal Policy Optimization
Schulman et al. (2017) — Proximal Policy Optimization Algorithms. [arXiv:1707.06347]
Adaptive-ML Blog (2023) — From Zero to PPO: Understanding the Path to Helpful AI Models. [Link]
Secrets of RLHF in LLMs (2023) — Part I: PPO Explained in Detail. [arXiv:2307.04964]
🎯 2. DPO - Direct Preference Optimization
🎯 3. REINFORCE & RLOO
Williams (1992) — Simple statistical gradient-following algorithms for connectionist reinforcement learning (REINFORCE). [Machine Learning]
Ahmadian et al. (2024) — Back to Basics: Revisiting REINFORCE Style Optimization for Learning from Human Feedback in LLMs. (ACL 2024) [arXiv:2402.14740]
HuggingFace Blog (2024) — Putting RL Back in RLHF. [Blog]
🚀 3a. DAPO - Decoupled Clip and Dynamic Sampling Policy Optimization
🔁 3b. GRPO - Grouped Relative Policy Optimization
Shao et al. (2024) — DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models. (Introduces GRPO) [arXiv:2402.03300]
Mroueh et al. (2025) — Revisiting Group Relative Policy Optimization. [arXiv:2505.22257]
Samia Sahin (2025) — The Math Behind DeepSeek — GRPO Explained. [Medium]
🔁 3. DRPO - Decoupled Rewards Policy Optimization
Li et al. (2025) - DRPO: Efficient Reasoning via Decoupled Reward Policy Optimization. DRPO Paper
⚖️ 4. KL Penalty
APXML Guide (2023) — KL Divergence Penalty in RLHF. [Article]
🚀 5. DeepSeek RL — Reinforcement Learning for Reasoning
Guo et al. (2025) — DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning. [Link]
Analytics Vidhya (2025) — LLM Optimization: GRPO, PPO, and DPO. [Blog]
🧠 6. Reward Hacking & Specification Gaming
🧩 Others
Christiano et al. (2017) — Deep Reinforcement Learning from Human Preferences. (NeurIPS 2017) [arXiv:1706.03741]
Ouyang et al. (2022) — Training Language Models to Follow Instructions with Human Feedback (InstructGPT). [arXiv:2203.02155]
Bai et al. (2022) — Training a Helpful and Harmless Assistant with RLHF. [arXiv:2204.05862]
Bai et al. (2022) — Constitutional AI: Harmlessness from AI Feedback. (Anthropic) [arXiv:2212.08073]
Lee et al. (2023) — RLAIF: Scaling Reinforcement Learning from Human Feedback with AI Feedback. [arXiv:2309.00267]
Yuan et al. (2024) — Self-Rewarding Language Models. (Meta AI) [arXiv:2401.10020]
📊 7. Alignment Evaluation & LLM-as-Judge
Zheng et al. (2023) — Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena. (NeurIPS 2023) [arXiv:2306.05685]
Li et al. (2024) — From Crowdsourced Data to High-Quality Benchmarks: Arena-Hard and BenchBuilder Pipeline. (ICML 2025) [arXiv:2406.11939]
Dubois et al. (2024) — Length-Controlled AlpacaEval: A Simple Way to Debias Automatic Evaluators. [arXiv:2404.04475]
Lin et al. (2022) — TruthfulQA: Measuring How Models Mimic Human Falsehoods. (ACL 2022) [arXiv:2109.07958]
Zhou et al. (2023) — Instruction-Following Evaluation for Large Language Models (IFEval). [arXiv:2311.07911]
Lambert et al. (2024) — RewardBench: Evaluating Reward Models for Language Modeling. [arXiv:2403.13787]
Liu et al. (2024) — RM-Bench: Benchmarking Reward Models of Language Models with Subtlety and Style. [arXiv:2410.16184]
White et al. (2024) — LiveBench: A Challenging, Contamination-Free LLM Benchmark. [arXiv:2406.19314]
🔢 8. Verification Metrics & Reasoning Benchmarks
Chen et al. (2021) — Evaluating Large Language Models Trained on Code (HumanEval / pass@k). [arXiv:2107.03374]
Wang et al. (2023) — Self-Consistency Improves Chain of Thought Reasoning in Language Models. (ICLR 2023) [arXiv:2203.11171]
Lightman et al. (2023) — Let's Verify Step by Step (PRMs on MATH). (ICLR 2024) [arXiv:2305.20050]
Jimenez et al. (2024) — SWE-bench: Can Language Models Resolve Real-world Github Issues? (ICLR 2024) [arXiv:2310.06770]
Rein et al. (2023) — GPQA: A Graduate-Level Google-Proof Q&A Benchmark. [arXiv:2311.12022]
Hendrycks et al. (2021) — Measuring Mathematical Problem Solving With the MATH Dataset. [arXiv:2103.03874]
Hendrycks et al. (2021) — Measuring Massive Multitask Language Understanding (MMLU). [arXiv:2009.03300]
🧠 9. Reasoning Techniques
Weston & Sukhbaatar (2023) — System 2 Attention (is something you might need too). (Meta AI) [arXiv:2311.11829]
Madaan et al. (2023) — Self-Refine: Iterative Refinement with Self-Feedback. [arXiv:2303.17651]
Shinn et al. (2023) — Reflexion: Language Agents with Verbal Reinforcement Learning. [arXiv:2303.11366]
Gou et al. (2023) — CRITIC: Large Language Models Can Self-Correct with Tool-Interactive Critiquing. [arXiv:2305.11738]
Zelikman et al. (2022) — STaR: Bootstrapping Reasoning With Reasoning. (NeurIPS 2022) [arXiv:2203.14465]
Yao et al. (2023) — Tree of Thoughts: Deliberate Problem Solving with Large Language Models. (NeurIPS 2023) [arXiv:2305.10601]
Besta et al. (2024) — Graph of Thoughts: Solving Elaborate Problems with Large Language Models. (AAAI 2024) [arXiv:2308.09687]
Wei et al. (2022) — Chain-of-Thought Prompting Elicits Reasoning in Large Language Models. (NeurIPS 2022) [arXiv:2201.11903]
Kojima et al. (2022) — Large Language Models are Zero-Shot Reasoners. (NeurIPS 2022) [arXiv:2205.11916]
🤝 10. Debate & Multi-Agent Reasoning
Irving et al. (2018) — AI Safety via Debate. (OpenAI) [arXiv:1805.00899]
Du et al. (2023) — Improving Factuality and Reasoning in Language Models through Multiagent Debate. [arXiv:2305.14325]
Back to top