Fail fast & recover faster: infrastructure resilience of multi-node LLM training
Nebius AI Nebius AI
342 subscribers
309 views
12

 Published On Apr 25, 2024

Training an LLM model in a multi-node setup is a complex and expensive process. Training failures can't be eliminated, but downtime can be reduced.

In this talk, Filipp Fisin, Senior ML Engineer at NebiusAI, provide an overview of techniques for more resilient training that we've found useful in our JAX-based multi-node training setup, namely:
- multi-node training orchestration in Kubernetes via Argo with automatic failure recovery
- a special type of Kubernetes health-checks to detect if a training process is stuck
- techniques to efficiently save and load terabyte-scale checkpoints
- XLA compilation cache
- GPU node monitoring and auto-cordoning

✅ Find out more on our website: https://nebius.ai/

show more

Share/Embed