WitnessAI Logo

WitnessAI

SRE - Performance Engineering

Posted 8 Days Ago
Be an Early Applicant
7 Locations
Senior level
7 Locations
Senior level
The Site Reliability Engineer will focus on performance analysis, optimization, and reliability of cloud infrastructure utilizing advanced methodologies and debugging tools.
The summary above was generated by AI

Job Title: Site Reliability Engineering - Performance Engineer

Location:  Bay Area preferred/Hybrid

Department: DevOps

At WitnessAI, we're at the intersection of innovation and security in AI.  We are seeking a Site Reliability Engineer - This role emphasizes deep systems-level performance analysis, tuning, and optimization to ensure the reliability and efficiency of our cloud-based infrastructure. You will drive performance across a tech stack that includes Cloud Infrastructure, Linux, Kubernetes, databases, message queuing systems, AI workloads, and GPUs. The ideal candidate brings a passion for data-driven methodologies, flame graph analysis, and advanced performance debugging to solve complex system challenges.

Key Responsibilities

  • Conduct root cause analysis (RCA) for performance bottlenecks using data-driven approaches like flame graphs, heatmaps, and latency histograms.

  • Perform detailed kernel and application tracing using tools based on technologies like eBPF, perf, and ftrace to gain insights into system behavior.

  • Design and implement performance dashboards to visualize key performance metrics in real-time.

  • Recommend Linux and Cloud Server tuning improvements to increase throughput and latency 

  • Tune Linux systems for workload-specific demands, including scheduler, I/O subsystem, and memory management optimizations.

  • Analyze and optimize cloud instance types, EBS volumes, and network configurations for high performance and low latency.

  • Improve throughput and latency for message queues (e.g., ActiveMQ, Kafka, SQS, etc) by profiling producer/consumer behavior and tuning configurations.

  • Apply profiling tools to analyze GPU utilization and kernel execution times and implement techniques to boost GPU efficiency.

  • Optimize distributed training pipelines using industry-standard frameworks.

  • Evaluate and reduce training times through mixed precision training, model quantization, and resource-aware scheduling in Kubernetes.

  • Work with AI teams to identify scaling challenges and optimize GPU workloads for inference and training.

  • Design observability systems for granular monitoring of end-to-end latency, throughput, and resource utilization.

  • Implement and leverage modern observability stacks to capture critical insights into application and infrastructure behavior.

  • Work with developers to refactor applications for performance and scalability, using profiling tools

  • Mentor teams on performance best practices, debugging workflows, and methodologies inspired by leading performance engineers.

Qualifications Required:

  • Deep expertise in Linux systems internals (kernel, I/O, networking, memory management) and performance tuning.

  • Strong experience with AWS cloud services and their performance optimization techniques.

  • Proficiency in performance analysis and load testing  tools and other system tracing frameworks.

  • Hands-on experience with database tuning, query analysis, and indexing strategies.

  • Expertise in GPU workload optimization, and cloud-based GPU instances

  • Familiarity with message queuing systems including performance tuning.

  • Programming experience with a focus on profiling and tuning

  • Strong scripting skills (e.g., Python, Bash) to automate performance measurement and tuning workflows.

Preferred:

  • Knowledge of distributed AI/ML training frameworks

  • Experience designing and scaling GPU workloads on Kubernetes using GPU-aware scheduling and resource isolation.

  • Expertise in optimizing AI inference pipelines.

  • Familiarity with Brendan Gregg’s methodologies for systems analysis, such as USE (Utilization, Saturation, Errors) and Workload Characterization Frameworks.

Benefits:

  • Hybrid work environment

  • Competitive salary

  • Health, dental, and vision insurance

  • 401(k) plan

  • Opportunities for professional development and growth

  • Generous vacation policy

Salary range:

$180,000-$220,000

Top Skills

Activemq
AWS
Bash
Ebpf
Ftrace
Kafka
Kubernetes
Linux
Perf
Python
Sqs

Similar Jobs

2 Hours Ago
Remote
Hybrid
6 Locations
Mid level
Mid level
Cloud • Computer Vision • Information Technology • Sales • Security • Cybersecurity
As a UI Engineer, you will build and maintain single-page web applications with Ember.js, collaborate with diverse teams, and drive technical aspects of product delivery.
Top Skills: AWSCSSEmberGitHTMLJavaScriptMochaPostcssQunitSassTailwind Css
2 Hours Ago
Hybrid
Toronto, ON, CAN
Senior level
Senior level
Enterprise Web • Fintech • Financial Services
The Lead Software Engineer will develop and maintain APIs, manage cloud technologies, mentor junior engineers, and ensure system scalability and availability.
Top Skills: AWSCi/CdDevOpsJavaPythonRestful Apis
4 Hours Ago
Remote
Hybrid
Toronto, ON, CAN
Senior level
Senior level
Artificial Intelligence • Hardware • Information Technology • Security • Software • Cybersecurity • Big Data Analytics
Design, build, and maintain cloud infrastructures for Motorola Solutions’ platforms. Ensure security best practices and automate processes for scalability and performance in a remote environment.
Top Skills: ArgocdAWSAzureBashCi/CdElastic StackElasticsearchGCPGitGoJavaScriptKubernetesOpentofuPrometheusPythonSQLTerraformThanos

What you need to know about the Vancouver Tech Scene

Raincouver, Vancity, The Big Smoke — Vancouver is known by many names, and in recent years, it has gained a reputation as a growing hub for both tech and sustainability. Renowned for its natural beauty, the city has become a magnet for professionals eager to create environmental solutions, and with an emphasis on clean technology, renewable energy and environmental innovation, it's attracted companies across various industries, all working toward a shared goal: advancing clean technology.

Sign up now Access later

Create Free Account

Please log in or sign up to report this job.

Create Free Account