Featherless AI — ML Engineer — Training Optimization
About the role
Featherless AI is looking for an ML Engineer to bring training efficiency research into production. You will implement and scale training optimization techniques that reduce compute cost and improve model quality across the open-source models Featherless hosts, turning research insights into reliable engineering systems.
What you'll do
- Implement training efficiency techniques at scale: mixed precision, gradient accumulation, activation checkpointing
- Optimize distributed training across large GPU clusters using FSDP, DeepSpeed, or Megatron-LM
- Profile and debug training instabilities, loss spikes, and memory bottlenecks
- Build tooling for training monitoring, evaluation, and experiment reproducibility
- Translate optimization research findings into production training workflows
Requirements
- Strong ML engineering background with production training pipeline experience
- Experience with distributed training frameworks (FSDP, DeepSpeed, Megatron-LM)
- Proficiency in Python and PyTorch or JAX
- Track record shipping optimized training pipelines at scale across large model families
About Featherless AI
Featherless AI is a serverless inference platform hosting 3,000+ open-source LLMs, letting developers call any model via a simple API without managing GPU infrastructure.
AI Alerts shares third-party job opportunities for informational purposes only. We are not the employer and are not involved in the hiring process. Always verify the company and role through official channels before applying, and never pay to apply, train, onboard, process documents, or secure a job offer. Legitimate employers do not ask applicants for money. Read our Terms to learn more.