Back to Dashboard

Senior Site Reliability Engineer

Astra Tech

Abu Dhabi - United Arab Emirates | First seen: 06 Feb 2026 14:04 | Foundit

Full-time
Apply Links

Click to view the job description. Download the JD manually if needed.

Foundit Zero Tax Jobs BeBee AE Trabajo.org Empllo
Job Description
Role Summary We are looking for a Senior Site Reliability Engineer (SRE) to ensure the reliability, scalability, and performance of botim's real-time communication and open platform infrastructure, supporting millions of active users globally. In this role, you will lead automation initiatives, operate and optimize large-scale Kubernetes clusters, and maintain highly available services across botim's cloud-native, microservices-based ecosystem. You will work closely with platform, VoIP, and backend engineering teams to strengthen observability using Prometheus, improve CI/CD pipelines, implement Infrastructure as Code, and optimize cloud costs. This role is ideal for an experienced SRE who thrives in high-availability environments, enjoys solving complex production issues, and is passionate about building resilient systems that power real-time messaging and calling at scale. Responsibilities • Automate routine operational tasks using Shell scripting, ensuring efficiency in log analysis, batch management, and system optimization. • Maintain and optimize middleware components supporting infrastructure operations, ensuring stability and performance. • Administer and optimize Kubernetes clusters, ensuring scalability, security, and performance. • Maintain and optimize monitoring and alerting systems based on Prometheus, ensuring high availability of services. • Contribute to the development of CI/CD pipelines Manage cloud resources efficiently, implementing cost optimization strategies to reduce cloud expenditure. • Improve operational processes, develop automation tools, troubleshoot incidents, and enhance system stability and reliability. Requirements • Proficiency in Shell scripting for automating operational workflows and system management tasks. • Experience in Python or Go, preferably for system automation, tooling, or backend services. • At least 2 years of hands-on Kubernetes administration experience, including expertise in CSI, CNI, and managing clusters with 20+ nodes in production. • Experience with Prometheus for monitoring and alerting in an enterprise environment. • Familiarity with CI/CD deployment processes, with knowledge of GitOps principles. Hands-on experience with GitOps is a plus. • Experience managing cloud platforms using Infrastructure as Code (IaC) tools like Terraform/OpenTofu. Azure experience is a plus. • Strong problem-solving skills, a proactive approach to troubleshooting, and a commitment to improving operational efficiency and system reliability. • Bonus Points: Experience managing large-scale distributed systems and microservices architecture. Background in Site Reliability Engineering (SRE) best practices
Notes
Notification History
failed 06 Feb 14:04
HTTP 404 error: Unable to create record: The requested resource /2010-04-01/Accounts/AC220246b462643a27d892cf705b810f79 /Messages.json was not found
Metadata

Source: google_jobs

Via: Foundit

Search Query: Senior SRE

First Seen: 06 Feb 2026 14:04 UTC

Last Seen: 07 Feb 2026 00:30 UTC

Source Job ID: eyJqb2JfdGl0bGUiOiJTZW5pb3IgU2l0ZSBSZWxpYWJpbGl0eSBFbmdpbmVlciIsImNvbXBhbnlfbmFtZSI6IkFzdHJhIFRlY2giLCJhZGRyZXNzX2NpdHkiOiJBYnUgRGhhYmkgLSBVbml0ZWQgQXJhYiBFbWlyYXRlcyIsImh0aWRvY2lkIjoiZlVzNkQ5MzRKeTV3OTh6VEFBQUFBQT09IiwidXVsZSI6IncrQ0FJUUlDSVVWVzVwZEdWa0lFRnlZV0lnUlcxcGNtRjBaWE0iLCJobCI6ImVuIn0=