LLMs | Alignment of Language Models: Reward Maximization-II

LLMs | Alignment of Language Models: Reward Maximization-II | Lec 13.2

3.9K subscribers

234 views

About
Share

Published On Premiered Sep 26, 2024

tl;dr: Looking into different algorithms for training the policy model, which is the LLM, to maximize the reward – REINFORCE, PPO.

🎓 Lecturer: Gaurav Pandey [ / gaurav-pandey-11321120 ]
🔗 Get the Slides Here: http://lcs2.in/llm2401
📚 Suggested Readings:
[Proximal Policy Optimization Algorithms](https://arxiv.org/pdf/1707.06347)
[Training language models to follow instructions with human feedback](https://proceedings.neurips.cc/paper_...)
[Back to Basics: Revisiting REINFORCE Style Optimization for Learning from Human Feedback in LLMs](https://arxiv.org/pdf/2402.14740)

Published On Premiered Sep 26, 2024

Share/Embed

Video Link