Role Summary
We are looking for a Senior Site Reliability Engineer (SRE) to ensure the reliability, scalability, and performance of botim's real-time communication and open platform infrastructure, supporting millions of active users globally. In this role, you will lead automation initiatives, operate and optimize large-scale Kubernetes clusters, and maintain highly available services across botim's cloud-native, microservices-based ecosystem.
You will work closely with platform, VoIP, and backend engineering teams to strengthen observability using Prometheus, improve CI/CD pipelines, implement Infrastructure as Code, and optimize cloud costs. This role is ideal for an experienced SRE who thrives in high-availability environments, enjoys solving complex production issues, and is passionate about building resilient systems that power real-time messaging and calling at scale.
Responsibilities
• Automate routine operational tasks using Shell scripting, ensuring efficiency in log analysis, batch management, and system optimization.
• Maintain and optimize middleware components supporting infrastructure operations, ensuring stability and performance.
• Administer and optimize Kubernetes clusters, ensuring scalability, security, and performance.
• Maintain and optimize monitoring and alerting systems based on Prometheus, ensuring high availability of services.
• Contribute to the development of CI/CD pipelines Manage cloud resources efficiently, implementing cost optimization strategies to reduce cloud expenditure.
• Improve operational processes, develop automation tools, troubleshoot incidents, and enhance system stability and reliability.
Requirements
• Proficiency in Shell scripting for automating operational workflows and system management tasks.
• Experience in Python or Go, preferably for system automation, tooling, or backend services.
• At least 2 years of hands-on Kubernetes administration experience, including expertise in CSI, CNI, and managing clusters with 20+ nodes in production.
• Experience with Prometheus for monitoring and alerting in an enterprise environment.
• Familiarity with CI/CD deployment processes, with knowledge of GitOps principles. Hands-on experience with GitOps is a plus.
• Experience managing cloud platforms using Infrastructure as Code (IaC) tools like Terraform/OpenTofu. Azure experience is a plus.
• Strong problem-solving skills, a proactive approach to troubleshooting, and a commitment to improving operational efficiency and system reliability.
• Bonus Points: Experience managing large-scale distributed systems and microservices architecture. Background in Site Reliability Engineering (SRE) best practices