Intelcom- Dragonfly

Senior Site Reliability Engineer (SRE)

Posted 3 Days Ago

Be an Early Applicant

In-Office

Montréal, QC

Senior level

In-Office

Montréal, QC

Senior level

The Senior Site Reliability Engineer role involves managing incidents, automating tasks, optimizing performance, ensuring high availability, and collaborating across teams.

The summary above was generated by AI

Intelcom | Dragonfly

With more than 100 sorting stations and operations across three continents, Intelcom | Dragonfly is Canada’s leader in last-mile logistics. Our vision is clear: to deliver fast, accurate, and reliable service powered by cutting-edge technology.

A Strategic Role at the Heart of Logistics

Responsibilities

Incident Management: Detect and respond to issues, ensuring rapid recovery to minimize downtime. Current on-call contributors need better coordination and structure in investigations. This role involves off-hours events, but these are cyclical with quieter periods. Define and implement an escalation process. Ensure the communication and adhesion of all the stakeholders across the business to the process. Document incident reports and conduct post-mortems to promote a continuous improvement approach.
Collaboration: Work closely with development and operations teams to ensure smooth deployment and operation of applications. Provide primary operational support and engineering for large-scale distributed software applications. Collaborate with development teams to improve services through rigorous testing and release procedures. Participate in system design consulting, platform management, and capacity planning. This requires a diligent follow-up and close collaboration with all teams
Influence: Create sustainable systems and services through automation and enhancements. Promote a culture of innovation and continuous improvement within the SRE team and the broader organization. Coordinate the SRE team in establishing and executing operational policies that promote agility and scalability. Coordinate and mentor SRE team members, fostering professional growth and development. Work closely with development and operations teams to ensure smooth deployment and operation of applications.
Automation: Automate repetitive tasks to improve efficiency and reduce human errors. Improve the reliability, quality, and time-to-market of our software solutions. Measure and optimize system performance anticipating business needs.
Monitoring and Alerting: Implement and enhance monitoring systems (e.g., Datadog) to track the health and performance of applications and infrastructure. There are existing systems, but additional ones are needed. Monitor and maintain the production environment, ensuring high availability and system health. Gather and process metrics from operating systems and applications to assist in performance tuning and fault finding. Develop an health monitoring dashboard to enable the visibility of our various stakeholders on our production environment.
Disaster Recovery: Prepare and implement disaster recovery plans to manage unexpected outages.
Performance Optimization: Continuously improve system performance and scalability.
Capacity Planning: Ensure the infrastructure can handle current and future demands.
Chaos Engineering: Intentionally introduce failures to test system resilience and improve robustness.

Qualifications

Bachelor's degree in software engineering, computer science or equivalent.
Minimum of 7 years experience in cloud management, development and/or SRE responsibilities.
Experience in Agile methodology and technical project execution.
Knowledgeable in DevOps concepts, AWS, Azure, GCP, observability tools (Datadog, cloudflare), Terraform, PagerDuty and how to integrate all these things together.

Other Skills:

Strong initiative and resilience, with a demonstrated ability to explore new ideas and innovative approaches to solving complex problems.
Excellent interpersonal and communication skills in both French and English.
Be able and comfortable evolving in fast-moving environment.

Schedule: Primarily daytime hours, but on-call availability is required for the initial months to observe and refine existing processes.

Why Join Us?

At Intelcom | Dragonfly, you’ll thrive in a flexible and stimulating environment, surrounded by passionate talent. You’ll also enjoy a wide range of benefits:

On-site gym with a personal trainer

Employer-provided lunch of your choice

Comprehensive group insurance

Group RRSP plan

Wellness days

Partial reimbursement for public transportation

Employee Assistance Program

…and much more.

This position has been opened to address a genuine organizational need within the company.

At Intelcom | Dragonfly, we move forward guided by strong values: collaboration, innovation, excellence, and responsibility.

We embrace diversity, ensure equity, and foster a true sense of belonging.

Accommodation measures are available for individuals with disabilities throughout our recruitment process, in compliance with the law. Please let us know if you have any specific needs.

Top Skills

AWS

Azure

Datadog

GCP

Pagerduty

Terraform

Similar Jobs

Mithril

Senior Site Reliability Engineer

20 Days Ago

Easy Apply

In-Office or Remote

Easy Apply

Senior level

Artificial Intelligence • Cloud • Information Technology • Software

The Senior Site Reliability Engineer is responsible for managing AI infrastructure, ensuring reliability through scalability, incident response, and collaboration with suppliers, focusing on Kubernetes and advanced GPU services.

Top Skills: AnsibleBashGrafanaKubernetesPrometheusPython

WEX Inc.

Site Reliability Engineer

3 Days Ago

In-Office or Remote

Senior level

Fintech • Payments

The Senior Staff SRE leads reliability engineering initiatives, drives operational excellence, mentors staff, and influences architecture to enhance system reliability and performance.

Top Skills: Ai/MlAWSAzureDockerElk StackGCPGrafanaKubernetesMySQLNoSQLPostgresSplunk

Kong

Senior Site Reliability Engineer

9 Days Ago

In-Office or Remote

Senior level

Artificial Intelligence • Cloud • Information Technology • Software • Big Data Analytics

The role involves operating and scaling Kong's SaaS platform, building automated infrastructure, optimizing multi-region data layers, enhancing observability, and ensuring reliability across services.

Top Skills: ArgocdAWSAzureBashClickhouseDatadogDruidGCPGoGrafanaHelmKubernetesPostgresPrometheusPythonRedisTerraformTerragruntThanos

What you need to know about the Vancouver Tech Scene

Raincouver, Vancity, The Big Smoke — Vancouver is known by many names, and in recent years, it has gained a reputation as a growing hub for both tech and sustainability. Renowned for its natural beauty, the city has become a magnet for professionals eager to create environmental solutions, and with an emphasis on clean technology, renewable energy and environmental innovation, it's attracted companies across various industries, all working toward a shared goal: advancing clean technology.