LLMs | Tokenization Strategies | Lec 9
LCS2 LCS2
3.9K subscribers
589 views
14

 Published On Premiered Aug 27, 2024

tl;dr: This lecture covers key tokenization strategies such as Byte-Pair Encoding, WordPiece, and Unigram Language Model, essential for anyone looking to enhance their understanding of how language models efficiently process text.

🎓 Lecturer: Tanmoy Chakraborty [https://tanmoychak.com]
🔗 Get the Slides Here: http://lcs2.in/llm2401
📚 Suggested Readings:
Byte Pair Encoding [https://arxiv.org/abs/1508.07909]
WordPiece [https://ieeexplore.ieee.org/stamp/sta...]
Unigram Language Model [https://arxiv.org/abs/1804.10959]

Unlock the fundamentals of tokenization in NLP with this lecture focusing on Byte-Pair Encoding (BPE), WordPiece, and Unigram Language Model tokenization. These strategies are pivotal in how modern language models process and understand text by breaking down complex script into manageable pieces. This session is ideal for those seeking to understand the mechanics behind effective language model training and its application across various NLP tasks.

show more

Share/Embed