Site Reliability Engineer

Details of the offer

Electrum is the next-generation payments technology company that provides cloud-native software to optimise the processing of financial transactions. Since 2012, we have established ourselves as a respected payments technology partner through our deep expertise and track record in delivering trusted enterprise-grade payments solutions.We've built a reputation in providing solutions for high-volume, low-value payment schemes and services that enable our clients to deliver to their customers at scale. We love that the projects we work on touch the lives of millions of South Africans daily, making a real difference.We hire the best of the best and we offer great opportunities for personal growth and career progression.Site Reliability Engineers (SREs) are responsible for monitoring, automating, and improving the reliability, scalability, performance and availability of our services. SREs work on tasks such as preventing incidents, managing infrastructure reliability, building effective monitoring systems and ensuring smooth operations of cloud production systems.Service Reliability and AvailabilityCollaborate with teams to develop reliable, available, and scalable applications.Work closely with the development team to understand, address, and prevent technical issues.Participate in on-call rotations and manage critical incidents.Develop and maintain incident response processes and alerting mechanisms.Develop and maintain tools to monitor application and service SLIs and SLOs.System Troubleshooting and Problem ResolutionDiagnose and resolve infrastructure and system-level issues, ensuring minimal downtime and swift problem resolution.Respond to and investigate incidents related to infrastructure and applications, utilising diagnostic tools to track down and remediate issues.Participate in on-call rotations to provide 24/7 operational support as necessary.Observability and AutomationUtilise technologies to develop and maintain effective log management and monitoring solutions for internal and external customers.Evaluate system health, identify performance bottlenecks and proactively optimise performance and cost-effectiveness.Implement automation tools and frameworks for deployment, configuration, and monitoring processes.Capacity management and planning for systems to ensure continued reliability.Process ImprovementsOffer recommendations and improvements to enhance performance, security, and scalability.Evaluate and integrate emerging technologies, cloud services and automation tools to improve operational efficiency.Drive cost-optimization initiatives by identifying opportunities for resource right-sizing, efficiency and other cost-saving measures.Disaster RecoveryDesign and implement disaster recovery strategies, including backup and restoration processes, to ensure business continuity.Develop and update incident management procedures, ensuring effective incident response by providing technical solutions and implementing preventative measures.Regularly assess system performance, identify irregularities, troubleshoot issues, and ensure high system availability. This includes performing or facilitating Disaster Recovery tests.RequirementsBachelor's degree in Computer Science, Information Technology, or related field preferred.3+ years experience in an SRE or similar role.Familiarity with AWS services like EC2, S3, RDS, Lambda, EKS and CloudWatch.Demonstrable experience with observability tools like Elastic and Grafana.Development skills advantageous.Proficient troubleshooting and problem-solving skills.Excellent collaboration, communication, and time management skills.Attention to detail and ability to work effectively in a team environment.A good work-life balance is very important at Electrum. To help you manage your own time and energy, Electrum offers benefits such as:Flexibility around core working hours (nature of flexibility is negotiated per role based on business need)Daily cooked lunches and a stocked kitchen for the mid-day nibblesTeam socialising, getaways, and social outingsWe have created a safe, transparent environment where we know mistakes happen, and that's okay. We even have a 3 step approach to dealing with them:Tell everyone about itFix the mistakeTell everyone about itYou are responsible for your actions – both the successes and the failures.
#J-18808-Ljbffr

Nominal Salary: To be agreed

Source: Whatjobs_Ppc

Job Function:

Engineering

Requirements

Similar offers

See more similar offers

Lead Industrial Simulation Engineer

Triz Engineering Solutions is looking for a highly skilled and experienced Lead Industrial Simulation Engineer to join our team. The successful candidate wil...

Trizengineering - Western Cape

Published a month ago

Senior Engineer: Electrical

PURPOSE OF ROLE TheSenior Engineer: Electrical will provide specialist support and expertise to the electrical engineering team throughout all project lifecy...

Lesedi Nuclear Services Pty. Ltd. - Western Cape

Published 13 days ago

Technical Assistant: Dairy

Job category: FMCG, Retail, Wholesale and Supply Chain Location: Cape Town Contract: Permanent Remuneration: Market related EE position: No IntroductionTo pr...

Woolworths - Western Cape

Published 13 days ago

Team Leader: Platform Engineering Delivery

Team Leader: Platform Engineering DeliveryWe're on the lookout for energetic, self-motivated individuals who share our passion for service in the banking ind...

Capitec Bank Ltd. - Western Cape

Published a month ago

Built at: 2024-12-23T16:18:43.821Z