Moniepoint is a financial technology company digitising Africa’s real economy by building a financial ecosystem for businesses, providing them with all the payment, banking, credit and business management tools they need to succeed.
Job Summary
- Responsible for ensuring our systems run smoothly and efficiently while engineering solutions to improve visibility, eliminate repetitive tasks, and increase system resilience. The ideal candidate will balance real-time on-call responsibilities with strategic engineering work to achieve sustainable and scalable service reliability.
What You’ll Get To Do
- Participate in on-call rotations as the primary technical lead for detecting, triaging, and resolving service degradation, outages, or reliability issues across all environments.
- Act as the Incident Commander during major incidents: initiating war room or bridge calls, coordinating cross-functional teams, providing timely and clear status updates to all stakeholders and leading/documenting blameless Root Cause Analyses (RCAs) to identify the root causes of issues and drive long-term fixes.
- Develop automation to eliminate manual and repetitive operational tasks (toil) related to reliability and operations across both applications and infrastructure to improve efficiency and system resilience.
- Create and maintain monitoring dashboards and alerts to monitor application and infrastructure health.
- Participate in feature development discussions to ensure services are built with observability from the ground up.
- Define and track Service Level Indicators (SLIs) and Service Level Objectives (SLOs) in collaboration with Product and Engineering teams.
- Investigate and resolve customer complaints escalated beyond L1 and L2 support, especially those involving performance, reliability, or complex system behavior.
To succeed in this role, we think you should have
- Minimum of 3 years of experience supporting enterprise applications in an SRE or similar role.
- Knowledge of distributed systems, microservices architecture and software design patterns.
- Experience with cloud platforms such as AWS, GCP, or Azure.
- Strong knowledge of Kubernetes and container orchestration tools.
- Experience using application performance monitoring tools, OpenTelemetry, and observability platforms such as New Relic, Datadog, ELK, or SigNoz
- Excellent problem-solving and troubleshooting skills as an on-call engineer, with the ability to resolve complex infrastructure and application issues.
- Proficient in setting up and maintaining monitoring dashboards and alerts using Grafana and Prometheus.
- Working knowledge of a scripting/programming language (e.g., Python, Bash)
- Proficiency in SQL databases (e.g., MySQL), writing complex sql queries against large datasets, and hands-on experience in database administration.
Method of Application
Signup to view application details.
Signup Now