Fail fast & recover faster: infrastructure resilience of multi-node LLM training

342 subscribers

309 views

About
Share

Published On Apr 25, 2024

Training an LLM model in a multi-node setup is a complex and expensive process. Training failures can't be eliminated, but downtime can be reduced.

In this talk, Filipp Fisin, Senior ML Engineer at NebiusAI, provide an overview of techniques for more resilient training that we've found useful in our JAX-based multi-node training setup, namely:
- multi-node training orchestration in Kubernetes via Argo with automatic failure recovery
- a special type of Kubernetes health-checks to detect if a training process is stuck
- techniques to efficiently save and load terabyte-scale checkpoints
- XLA compilation cache
- GPU node monitoring and auto-cordoning

✅ Find out more on our website: https://nebius.ai/

Published On Apr 25, 2024

Share/Embed

Video Link