Moniepoint is a financial technology company digitising Africa’s real economy by building a financial ecosystem for businesses, providing them with all the payment, banking, credit and business management tools they need to succeed.
Job Summary We are seeking an SRE Team Lead to guide a squad of site reliability engineers responsible for the reliability of our highly distributed financial platform.You will be designing high-level reliability architecture, while also mentoring engineers, defining the technical roadmap, and driving the culture of Site Reliability Engineering within a team. You will balance strategic leadership with deep technical work to ensure our systems and our people can scale linearly with our hyper-growth.
Responsibilities
- Set the technical direction for the SRE team. Architect self-healing systems, define reliability standards (Production Readiness Reviews), and drive the adoption of observability as Code and automation best practices.
- Define and enforce the end-to-end standard for system visibility. You will guide teams to deeply instrument their code (logging, tracing, metrics) and govern the monitoring ecosystem to ensure alerts are actionable, strictly minimizing noise (alert fatigue) while maximizing our ability to detect and resolve issues proactively.
- Lead, mentor, and grow a team of Senior and Associate SREs. Conduct code reviews, facilitate technical workshops and foster a culture of engineering excellence.
- Act as the ultimate escalation point for major incidents. Beyond firefighting, you will refine the Incident management process, ensuring the process is efficient and that RCAs lead to actionable engineering fixes.
- Partner with Engineering Managers and Product Leads to define Service Level Objectives (SLOs) that align with business goals.
Requirements
- Minimum of 5 years of experience in SRE or Backend Engineering, with at least 2 years in a Lead or Senior/Staff role mentoring others.
- Expert-level proficiency in Java, Go, Rust, or Python. You set the standard for code quality within the team.
- Mastery of distributed systems patterns. You can design scalable architectures, debug complex microservices interactions, and explain architectural trade-offs to stakeholders.
- Deep expertise with Google Cloud Platform (GCP) or AWS. You have extensive experience running Kubernetes (GKE) at scale and troubleshooting deep infrastructure issues.
- Proven experience defining observability strategies for large teams. You have deep expertise in architecting the complete telemetry stack: from custom instrumentation to monitoring and actionable alerting.
- Strong communication skills with the ability to de-escalate high-pressure war rooms with calm authority.
Method of Application
Signup to view application details.
Signup Now