Similar Jobs
We are seeking highly skilled and motivated software engineers to join our Nemo RL team. You will empower AI practitioners to develop and deploy large language models (LLMs) using reinforcement learning (RL) techniques on the Nemo RL framework, productively and affordably. If you have experience with multi-node distributed training jobs and are passionate about solving the challenges associated with high-performance RL systems for LLMs, we invite you to join our team!
What you’ll be doing:
Design and implement highly efficient distributed training systems for large-scale RL models.
Optimize parallelism strategies to improve performance and scalability across hundreds or thousands of GPUs.
Develop low-level systems components and algorithms to maximize throughput and minimize memory and compute bottlenecks.
Productionize the training systems with fault tolerance capabilities and an uncompromised software quality.
Collaborate with researchers and engineers to productionize cutting-edge model architectures and training techniques.
Contribute to the design of APIs, abstractions, and UX that make it easier to scale models while maintaining usability and flexibility.
Profile, debug, and tune performance at the model, system, and hardware levels.
Participate in design discussions, code reviews, and technical planning to ensure the product aligns with the business goals.
Stay up to date with the latest advancements in large-scale model training and help translate research into practical, robust systems.
What we need to see:
Bachelor’s, Master’s, or PhD degree in Computer Science/Engineering, Software Engineering, a related field, or equivalent experience.
3+ years of experience in software development, preferably with Python and C++.
Deep understanding of machine learning pipelines and workflows, distributed systems, parallel computing, and high-performance computing principles.
Hands-on experience with large-scale training of deep learning models using frameworks like PyTorch, Megatron Core, or DeepSpeed.
Experience optimizing compute, memory, and communication performance in large model training workflows.
Familiarity with GPU programming, CUDA, NCCL, and performance profiling tools.
Solid grasp of deep learning fundamentals, especially as they relate to RL and training dynamics.
Ability to work closely with both research and engineering teams, translating evolving needs into technical requirements and robust code.
Excellent problem-solving skills, with the ability to debug complex systems.
A passion for building high-impact tools that push the boundaries of what’s possible with large-scale AI.
Ways to stand out from the crowd:
Background with building and optimizing LLM pre-training or post-training frameworks such as DeepSpeed, torchtitan, Nanotron, verl.
Experience building and optimizing LLM inference engines such as vLLM, SGLang.
Experience building ML compilers such as Triton, Torch Dynamo/Inductor.
Background with working with cloud platforms (e.g., AWS, GCP, or Azure), containerization tools (e.g., Docker), and orchestration infrastructures (e.g., Kubernetes, Slurm).
Exposure to DevOps practices, CI/CD pipelines, and infrastructure as code.
At NVIDIA, we believe artificial intelligence (AI) will fundamentally transform how people live and work. Our mission is to advance AI research and development to create groundbreaking technologies that enable anyone to harness the power of AI and benefit from its potential. Our team consists of experts in AI, systems and performance optimization. Our leadership includes world-renowned experts in AI systems who have received multiple academic and industry research awards. If you've hacked the inner workings of PyTorch, or if you've written many CUDA/HIP kernels, or if you've developed and optimized inference services or training workloads, or if you've built and maintained large-scale Kubernetes clusters, or if you simply just enjoy solving hard problems, feel free to drop an application!
#LI-Hybrid
Your base salary will be determined based on your location, experience, and the pay of employees in similar positions. The base salary range is 116,250 CAD - 201,500 CAD for Level 3, and 142,500 CAD - 247,000 CAD for Level 4.You will also be eligible for equity and benefits.