Senior Software Engineer, AI Systems - Nemo RL

Sorry, this job was removed at 04:08 a.m. (PST) on Wednesday, Aug 20, 2025

Be an Early Applicant

In-Office or Remote

2 Locations

In-Office or Remote

2 Locations

Similar Jobs

Applied Systems

Software Engineer

3 Hours Ago

Remote or Hybrid

Canada

Mid level

Cloud • Insurance • Payments • Software • Business Intelligence • App development • Big Data Analytics

As a Software Engineer, you'll develop enterprise software, ensure quality standards, participate in reviews, and collaborate with an Agile team.

Top Skills: .Net 5/6AgileAngularC#CypressGitlabGoKubernetesReactSQL ServerTypescriptVisual Studio

Dropbox

IAM Engineer, Data and Systems Security Team

6 Hours Ago

Remote

Canada

Senior level

Artificial Intelligence • Cloud • Consumer Web • Productivity • Software • App development • Data Privacy

This role involves designing and implementing robust IAM solutions using Okta and SailPoint, managing access policies, and ensuring secure authentication across the enterprise.

Top Skills: APIsJavaScriptMfaOauthOidcOktaOkta WorkflowsPythonSailpointSAML

GitLab

Sr. Director, Global Renewals

6 Hours Ago

Easy Apply

Remote

Easy Apply

Senior level

Cloud • Security • Software • Cybersecurity • Automation

Lead and manage the renewals team to enhance customer retention, drive revenue growth, and improve operational efficiency within a fast-paced SaaS environment.

Top Skills: AIGitlabSaaSSubscription Business

We are seeking highly skilled and motivated software engineers to join our Nemo RL team. You will empower AI practitioners to develop and deploy large language models (LLMs) using reinforcement learning (RL) techniques on the Nemo RL framework, productively and affordably. If you have experience with multi-node distributed training jobs and are passionate about solving the challenges associated with high-performance RL systems for LLMs, we invite you to join our team!

What you’ll be doing:

Design and implement highly efficient distributed training systems for large-scale RL models.
Optimize parallelism strategies to improve performance and scalability across hundreds or thousands of GPUs.
Develop low-level systems components and algorithms to maximize throughput and minimize memory and compute bottlenecks.
Productionize the training systems with fault tolerance capabilities and an uncompromised software quality.
Collaborate with researchers and engineers to productionize cutting-edge model architectures and training techniques.
Contribute to the design of APIs, abstractions, and UX that make it easier to scale models while maintaining usability and flexibility.
Profile, debug, and tune performance at the model, system, and hardware levels.
Participate in design discussions, code reviews, and technical planning to ensure the product aligns with the business goals.
Stay up to date with the latest advancements in large-scale model training and help translate research into practical, robust systems.

What we need to see:

Bachelor’s, Master’s, or PhD degree in Computer Science/Engineering, Software Engineering, a related field, or equivalent experience.
3+ years of experience in software development, preferably with Python and C++.
Deep understanding of machine learning pipelines and workflows, distributed systems, parallel computing, and high-performance computing principles.
Hands-on experience with large-scale training of deep learning models using frameworks like PyTorch, Megatron Core, or DeepSpeed.
Experience optimizing compute, memory, and communication performance in large model training workflows.
Familiarity with GPU programming, CUDA, NCCL, and performance profiling tools.
Solid grasp of deep learning fundamentals, especially as they relate to RL and training dynamics.
Ability to work closely with both research and engineering teams, translating evolving needs into technical requirements and robust code.
Excellent problem-solving skills, with the ability to debug complex systems.
A passion for building high-impact tools that push the boundaries of what’s possible with large-scale AI.

Ways to stand out from the crowd:

Background with building and optimizing LLM pre-training or post-training frameworks such as DeepSpeed, torchtitan, Nanotron, verl.
Experience building and optimizing LLM inference engines such as vLLM, SGLang.
Experience building ML compilers such as Triton, Torch Dynamo/Inductor.
Background with working with cloud platforms (e.g., AWS, GCP, or Azure), containerization tools (e.g., Docker), and orchestration infrastructures (e.g., Kubernetes, Slurm).
Exposure to DevOps practices, CI/CD pipelines, and infrastructure as code.

At NVIDIA, we believe artificial intelligence (AI) will fundamentally transform how people live and work. Our mission is to advance AI research and development to create groundbreaking technologies that enable anyone to harness the power of AI and benefit from its potential. Our team consists of experts in AI, systems and performance optimization. Our leadership includes world-renowned experts in AI systems who have received multiple academic and industry research awards. If you've hacked the inner workings of PyTorch, or if you've written many CUDA/HIP kernels, or if you've developed and optimized inference services or training workloads, or if you've built and maintained large-scale Kubernetes clusters, or if you simply just enjoy solving hard problems, feel free to drop an application!

#LI-Hybrid

Your base salary will be determined based on your location, experience, and the pay of employees in similar positions. The base salary range is 116,250 CAD - 201,500 CAD for Level 3, and 142,500 CAD - 247,000 CAD for Level 4.

You will also be eligible for equity and benefits.

Applications for this job will be accepted at least until August 22, 2025.

What you need to know about the Vancouver Tech Scene

Raincouver, Vancity, The Big Smoke — Vancouver is known by many names, and in recent years, it has gained a reputation as a growing hub for both tech and sustainability. Renowned for its natural beauty, the city has become a magnet for professionals eager to create environmental solutions, and with an emphasis on clean technology, renewable energy and environmental innovation, it's attracted companies across various industries, all working toward a shared goal: advancing clean technology.