Let's Write Africa's Story Together!Old Mutual is a firm believer in the African opportunity and our diverse talent reflects this.Job DescriptionROLE OVERVIEWThe Head of Site Reliability Engineering (SRE) is a critical leadership position responsible for ensuring the bank's technology systems and services are reliable, scalable, and resilient.
This role requires a deep understanding of infrastructure, monitoring, incident management, and automation, as well as a strong ability to lead and inspire a team of SRE engineers.
The successful candidate will play a pivotal role in driving operational excellence, optimizing service delivery, and fostering a culture of reliability across the bank's digital ecosystem.KEY RESULT AREASStrategy & LeadershipDefine and implement the SRE strategy, ensuring alignment with the bank's business and technology goals.Lead initiatives to enhance the reliability, availability, and performance of the bank's services.Promote and embed SRE principles across engineering and operations teams.Operational ReliabilityEstablish and maintain Service Level Objectives (SLOs) and Service Level Indicators (SLIs) to measure and improve service reliability.Oversee the development and operation of monitoring, logging, and alerting systems to detect and resolve issues proactively.Manage incident response and post-mortem processes, driving root cause analysis and preventive actions.Automation & EfficiencyDrive automation of operational tasks to reduce manual effort and improve efficiency.Lead initiatives to optimize system performance, reduce latency, and enhance system resilience.Champion the use of infrastructure as code and other modern engineering practices.Collaboration & Stakeholder ManagementPartner with development, infrastructure, and security teams to ensure seamless integration of SRE practices.Collaborate with business units to understand priorities and ensure reliability initiatives align with their needs.Act as the primary point of contact for SRE-related discussions with internal and external stakeholders.Team Leadership & DevelopmentBuild, mentor, and manage a high-performing SRE team, fostering a culture of collaboration and innovation.Drive continuous learning and skill development within the team to stay ahead of technological advancements.Identify and address resource gaps to ensure effective delivery of SRE initiatives.ROLE REQUIREMENTSBachelor's or Master's degree in Computer Science, Engineering, or a related field.10+ years of experience in infrastructure, operations, or site reliability engineering, with at least 3 years in a leadership role.Strong expertise in monitoring tools (e.g., Datadog, Prometheus, Grafana) and incident management platforms (e.g., PagerDuty).Experience in cloud platforms (AWS, Azure, GCP) and containerization technologies (Docker, Kubernetes).In-depth knowledge of automation tools, scripting languages, and CI/CD pipelines.Proven track record in driving system reliability, scalability, and performance improvements.Exceptional leadership and people management skills, with a focus on team development and motivation.Excellent problem-solving and analytical abilities, with strong attention to detail.Outstanding communication and stakeholder management skills.Closing Date09 January 2025, 23:59Old Mutual Limited is pro-vaccination and encourages its workforce to be fully vaccinated against Covid-19.All prospective employees are required to disclose their vaccination status as part of the recruitment process.
#J-18808-Ljbffr