About the Job:
The company is an ADX-listed public company and the region's leading big data analytics company powered by GenAI. We are seeking a passionate and skilled Site Reliability Engineer to join our TechOps team. In this role, you will ensure the stability and performance of our Big Data and AI platforms, ensuring maximum uptime and seamless operation. This position involves bridging the gap between development and operations and applying software engineering principles to infrastructure problems.
Key Responsibilities:
• Own the daily Operations & Maintenance (O&M) of our products, ensuring system health and performance.
• Serve as the first point of contact for customer incidents, providing frontline management and restoring service as needed.
• Guarantee maximum uptime for all deployments through system updates, patches, hotfixes, and upgrades.
• Participate in the incident management process, including issue triage, technical support, and escalation management.
• Facilitate technical issue prioritization and resolution coordination with clients, ensuring clear communication.
• Comply with Quality Health Safety and Environment (QHSE), Business Continuity, Information Security, Risk, and Compliance Management policies.
Qualifications:
• 8+ years of product administration experience in Linux-based environments (RHEL, CentOS, Debian & Ubuntu).
• Minimum of 5 years relevant experience as a Hadoop Administrator with expert-level knowledge of Cloud Operations.
• Experience managing Big Data platforms using automation tools like Ansible and Helm Charts.
• Knowledge of best practices for Data Warehousing including business intelligence and business continuity planning.
• Hands-on experience with monitoring platforms (Zabbix, Grafana) and collaboration with data delivery teams.
• Proven experience in implementing and administering applications in Air-Gapped infrastructure.
• Strong competency in Linux administration (security, configuration, tuning, troubleshooting, monitoring) and basic networking troubleshooting.
Skills:
• Proficient in monitoring, supporting, and administering production environments from onboarding to upgrades and recovery.
• Ability to design and maintain automation code for system deployment and configuration.
• Following production deployment procedures while ensuring network and security compliance.
• Documenting daily activities and contributing to the team's knowledge base.
• Capable of transforming ambiguity into clarity and tackling complex problems with a structured approach.
• Strong communication skills in conveying information clearly and consistently.
What We Look For:
Join us at the company for a culture of innovation, outstanding career growth opportunities, and competitive rewards. We invite you to become part of our community if you are passionate about building resilient systems and thriving in a dynamic, impactful environment.
What Working at the company Offers:
• Culture: An open, diverse, and inclusive environment encouraging personal growth and innovative solutions.
• Career: Opportunities for high-impact projects and continuous growth and learning resources.
• Rewards: Competitive remuneration package including healthcare, education support, leave benefits, and more.
Location: Abu Dhabi, Abu Dhabi Emirate, United Arab Emirates
Work Conditions: Full-time