This is a remote position.
About SiFi: SiFi is a rapidly growing B2B Fin-Tech company transforming expense management for businesses in Saudi Arabia. As a licensed EMI from the Saudi Central Bank, we empower companies with innovative tools to simplify finance management.
Position Overview
We are looking for a Senior Site Reliability Engineer (SRE) who will take ownership of the reliability, performance, and scalability of our production systems. You will design, automate, and operate mission-critical environments that include Kubernetes clusters, database disaster recovery, workflow orchestration, and multi-region networking.
This role suits engineers who think deeply about systems combining infrastructure, automation, and diagnostic reasoning to drive operational excellence.
Primary Responsibilities
Reliability, Availability & Infrastructure
• Maintain and evolve multi-region cloud infrastructure using Terraform-based Infrastructure as Code (IaC).
• Operate and optimize Kubernetes (OKE) clusters running microservices, data pipelines, and workflow orchestration.
• Manage SQL Server backup/restore pipelines, DR testing, and performance optimization.
• Ensure high availability for .NET and Python applications hosted behind load balancers and WAF.
• Design and maintain cross-network connectivity (DRGs, LPGs, VCNs, subnets, and NSGs).
Observability & Automation
• Build and maintain a centralized orchestration platform integrated with alerting and notification systems.
• Develop self-healing, monitoring, and auto-remediation scripts for infrastructure and databases.
• Implement logging, metrics, and tracing pipelines
• Automate recurring operational tasks using Python, Bash, and PowerShell to reduce manual effort and improve reliability.
DevOps, CI/CD & Security
• Manage GitHub Actions and Octopus Deploy pipelines for backend and data services.
• Apply strong security principles least privilege, network segmentation, secure credentials, and encrypted communications.
• Promote GitOps and Infrastructure-as-Code practices to ensure repeatable and traceable deployments.
• Collaborate with developers to embed reliability and resilience into every release
Collaboration & Incident Management
• Lead incident response, run blameless post-mortems , and turn findings into lasting improvements.
• Partner closely with engineering teams to drive design and code-level reliability improvements.
• Conduct capacity planning, cost optimization, and system tuning for performance and scalability.
• Mentor engineers in automation, observability, and root-cause analysis best practices
Troubleshooting Mindset & Diagnostic Thinking
We Value Engineers Who
• Approach issues systematically and validate assumptions with data.
• Treat incidents as opportunities to improve design and automation.
• Rely on metrics, logs, and tracing rather than guesswork.
• Communicate findings clearly and document learnings for future reference.
• Continuously refine how problems are detected, escalated, and resolved.