AI Research Engineer (Kernel & Inference Optimization)
About the role
Join Tether's AI model team to drive innovation in model serving and inference architectures for advanced AI systems — from resource-efficient edge models to complex multimodal architectures. You will write custom GPU compute kernels using Metal Shading Language, build high-throughput inference pipelines, and optimize distributed inference across massive GPU clusters and resource-constrained mobile devices.
What you'll do
- Design and deploy model serving architectures achieving high throughput and low latency while minimizing memory footprint across edge and resource-constrained environments
- Write custom GPU compute shaders from scratch using Metal Shading Language (MSL) for on-device AI inference on smartphones
- Build, run, and monitor controlled inference tests tracking response latency, throughput, memory consumption, and error rates
- Design distributed inference systems using Tensor Parallelism, Pipeline Parallelism, and Expert Parallelism for massive GPU-cluster deployments
- Identify and resolve computational bottlenecks including suboptimal batch processing, network delays, and high memory usage in production pipelines
- Apply pruning, quantization, Flash attention, KV Cache, and Speculative Decoding to continuously push inference performance boundaries
Requirements
- PhD in NLP, Machine Learning, or a related field (CS degree minimum) with A* conference publications and a strong AI R&D track record
- Must have hands-on knowledge of Metal Shading Language (MSL) — comfortable writing custom compute shaders from scratch
- Proven experience in low-level kernel optimizations and inference optimization on mobile/edge devices with measurable improvements in latency, throughput, and memory footprint
- Deep expertise in modern model serving architectures, inference optimization techniques, and efficient memory management for resource-constrained deployment
- Strong background in distributed inference, transformer/diffusion model architectures, and efficiency techniques (Flash attention, KV Cache, Speculative Decoding, quantization, pruning)
About Tether Operations Limited
Tether is a global fintech company behind USDT, the world's most widely used stablecoin, and operates divisions in AI, communications, energy, and education through its Tether Data, Tether Evo, Tether Power, and Tether Education subsidiaries.
Visit Tether Operations Limited→
AI Alerts shares third-party job opportunities for informational purposes only. We are not the employer and are not involved in the hiring process. Always verify the company and role through official channels before applying, and never pay to apply, train, onboard, process documents, or secure a job offer. Legitimate employers do not ask applicants for money. to learn more.