Skip to content

📚 References

🧮 1. PPO - Proximal Policy Optimization

  • Schulman et al. (2017)Proximal Policy Optimization Algorithms. [arXiv:1707.06347]
  • Adaptive-ML Blog (2023)From Zero to PPO: Understanding the Path to Helpful AI Models. [Link]
  • Secrets of RLHF in LLMs (2023)Part I: PPO Explained in Detail. [arXiv:2307.04964]

🎯 2. DPO - Direct Preference Optimization


🎯 3. REINFORCE & RLOO

  • Williams (1992)Simple statistical gradient-following algorithms for connectionist reinforcement learning (REINFORCE). [Machine Learning]
  • Ahmadian et al. (2024)Back to Basics: Revisiting REINFORCE Style Optimization for Learning from Human Feedback in LLMs. (ACL 2024) [arXiv:2402.14740]
  • HuggingFace Blog (2024)Putting RL Back in RLHF. [Blog]

🚀 3a. DAPO - Decoupled Clip and Dynamic Sampling Policy Optimization


🔁 3b. GRPO - Grouped Relative Policy Optimization

  • Shao et al. (2024)DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models. (Introduces GRPO) [arXiv:2402.03300]
  • Mroueh et al. (2025)Revisiting Group Relative Policy Optimization. [arXiv:2505.22257]
  • Samia Sahin (2025)The Math Behind DeepSeek — GRPO Explained. [Medium]

🔁 3. DRPO - Decoupled Rewards Policy Optimization

  • Li et al. (2025) - DRPO: Efficient Reasoning via Decoupled Reward Policy Optimization. DRPO Paper

⚖️ 4. KL Penalty

  • APXML Guide (2023)KL Divergence Penalty in RLHF. [Article]

🚀 5. DeepSeek RL — Reinforcement Learning for Reasoning

  • Guo et al. (2025)DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning. [Link]
  • Analytics Vidhya (2025)LLM Optimization: GRPO, PPO, and DPO. [Blog]

🧠 6. Reward Hacking & Specification Gaming


🧩 Others

  • Christiano et al. (2017)Deep Reinforcement Learning from Human Preferences. (NeurIPS 2017) [arXiv:1706.03741]
  • Ouyang et al. (2022)Training Language Models to Follow Instructions with Human Feedback (InstructGPT). [arXiv:2203.02155]
  • Bai et al. (2022)Training a Helpful and Harmless Assistant with RLHF. [arXiv:2204.05862]
  • Bai et al. (2022)Constitutional AI: Harmlessness from AI Feedback. (Anthropic) [arXiv:2212.08073]
  • Lee et al. (2023)RLAIF: Scaling Reinforcement Learning from Human Feedback with AI Feedback. [arXiv:2309.00267]
  • Yuan et al. (2024)Self-Rewarding Language Models. (Meta AI) [arXiv:2401.10020]

📊 7. Alignment Evaluation & LLM-as-Judge

  • Zheng et al. (2023)Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena. (NeurIPS 2023) [arXiv:2306.05685]
  • Li et al. (2024)From Crowdsourced Data to High-Quality Benchmarks: Arena-Hard and BenchBuilder Pipeline. (ICML 2025) [arXiv:2406.11939]
  • Dubois et al. (2024)Length-Controlled AlpacaEval: A Simple Way to Debias Automatic Evaluators. [arXiv:2404.04475]
  • Lin et al. (2022)TruthfulQA: Measuring How Models Mimic Human Falsehoods. (ACL 2022) [arXiv:2109.07958]
  • Zhou et al. (2023)Instruction-Following Evaluation for Large Language Models (IFEval). [arXiv:2311.07911]
  • Lambert et al. (2024)RewardBench: Evaluating Reward Models for Language Modeling. [arXiv:2403.13787]
  • Liu et al. (2024)RM-Bench: Benchmarking Reward Models of Language Models with Subtlety and Style. [arXiv:2410.16184]
  • White et al. (2024)LiveBench: A Challenging, Contamination-Free LLM Benchmark. [arXiv:2406.19314]

🔢 8. Verification Metrics & Reasoning Benchmarks

  • Chen et al. (2021)Evaluating Large Language Models Trained on Code (HumanEval / pass@k). [arXiv:2107.03374]
  • Wang et al. (2023)Self-Consistency Improves Chain of Thought Reasoning in Language Models. (ICLR 2023) [arXiv:2203.11171]
  • Lightman et al. (2023)Let's Verify Step by Step (PRMs on MATH). (ICLR 2024) [arXiv:2305.20050]
  • Jimenez et al. (2024)SWE-bench: Can Language Models Resolve Real-world Github Issues? (ICLR 2024) [arXiv:2310.06770]
  • Rein et al. (2023)GPQA: A Graduate-Level Google-Proof Q&A Benchmark. [arXiv:2311.12022]
  • Hendrycks et al. (2021)Measuring Mathematical Problem Solving With the MATH Dataset. [arXiv:2103.03874]
  • Hendrycks et al. (2021)Measuring Massive Multitask Language Understanding (MMLU). [arXiv:2009.03300]

🧠 9. Reasoning Techniques

  • Weston & Sukhbaatar (2023)System 2 Attention (is something you might need too). (Meta AI) [arXiv:2311.11829]
  • Madaan et al. (2023)Self-Refine: Iterative Refinement with Self-Feedback. [arXiv:2303.17651]
  • Shinn et al. (2023)Reflexion: Language Agents with Verbal Reinforcement Learning. [arXiv:2303.11366]
  • Gou et al. (2023)CRITIC: Large Language Models Can Self-Correct with Tool-Interactive Critiquing. [arXiv:2305.11738]
  • Zelikman et al. (2022)STaR: Bootstrapping Reasoning With Reasoning. (NeurIPS 2022) [arXiv:2203.14465]
  • Yao et al. (2023)Tree of Thoughts: Deliberate Problem Solving with Large Language Models. (NeurIPS 2023) [arXiv:2305.10601]
  • Besta et al. (2024)Graph of Thoughts: Solving Elaborate Problems with Large Language Models. (AAAI 2024) [arXiv:2308.09687]
  • Wei et al. (2022)Chain-of-Thought Prompting Elicits Reasoning in Large Language Models. (NeurIPS 2022) [arXiv:2201.11903]
  • Kojima et al. (2022)Large Language Models are Zero-Shot Reasoners. (NeurIPS 2022) [arXiv:2205.11916]

🤝 10. Debate & Multi-Agent Reasoning

  • Irving et al. (2018)AI Safety via Debate. (OpenAI) [arXiv:1805.00899]
  • Du et al. (2023)Improving Factuality and Reasoning in Language Models through Multiagent Debate. [arXiv:2305.14325]