LLMs | Alignment of Language Models: Reward Maximization-II | Lec 13.2
LCS2 LCS2
3.9K subscribers
234 views
4

 Published On Premiered Sep 26, 2024

tl;dr: Looking into different algorithms for training the policy model, which is the LLM, to maximize the reward – REINFORCE, PPO.

🎓 Lecturer: Gaurav Pandey [  / gaurav-pandey-11321120  ]
🔗 Get the Slides Here: http://lcs2.in/llm2401
📚 Suggested Readings:
[Proximal Policy Optimization Algorithms](https://arxiv.org/pdf/1707.06347)
[Training language models to follow instructions with human feedback](https://proceedings.neurips.cc/paper_...)
[Back to Basics: Revisiting REINFORCE Style Optimization for Learning from Human Feedback in LLMs](https://arxiv.org/pdf/2402.14740)

show more

Share/Embed