Job title : Lead Site Reliability Engineer
Job Location : South Africa,
Deadline : December 25, 2024
Quick Recommended Links
Jobs by Location
Job by industries
Main Responsibilities include: Work closely with the Platform & Product engineering teams to ensure that the platform, infrastructure and services are designed and optimised for availability, latency and performance Own and configure observability tooling Create and tune alerts to ensure we have adequate warning of impending failures, and check alerts as they are raised Investigate and resolve support issues escalated from the Tech Support team Lead incident response, resolution, root cause investigation, retrospective writing up and follow-up actions so we can take every opportunity to learn, improve and make our services more resilient Identify patterns in incoming incidents and document these for further investigation Collaborate with other SREs and Tech Support to improve processes and share knowledge/best practice
Skills/Experience Required: End-to-end delivery/ automation in a SRE, Platform or DevOps team Agile development practices & legacy platforms Engineering background, and is familiar with modern programming languages, ideally Python Scripting for automation Experienced in investigating and resolving technical issues, spanning performance, functionality and system interactions GCP, AWS, Azure (ideally GCP) Has strong experience and knowledge of observability, both in terms of best practices and tooling implementation/use (Datadog preferable, others will be accepted) Infrastructure as Code, such as Terraform or alternatives CI/CD Tools (preferably GitLab) Database experience and ability to understand/write SQL (mySQL/MariaDB preferable) Solid understanding of Linux Operating Systems (Debian preferable) Has understanding of the DevSecOps culture and experience in delivering technical outcomes within this culture Previous exp managing / mentoring a team SAAS Environment exp
Engineering / Technical jobs