DevOps Support Engineer
About AI Factory
The AI Factory operates sovereign AI infrastructure including:
• GPU clusters
• Cloud subscriptions
• Containerized workloads
• API gateways
• Multi-environment deployments (Sandbox → Staging → Production)
The DevOps Support Engineer ensures:
• Infrastructure stability
• Deployment reliability
• Operational continuity for AI workloads
Role Overview
Employment Type: Full-time
Work Arrangement: Onsite (Applicants based outside the UAE are required to relocate)
The DevOps Support Engineer is responsible for supporting:
• Cloud infrastructure
• CI/CD pipelines
• Containerized AI workloads
• API gateways
• Production environments
The role focuses on:
• Platform stability
• Environment health
• Deployment reliability
• Infrastructure troubleshooting
• Structured incident management
• Environment discipline
• Production governance
This is an operational reliability role aligned with modern DevOps, SRE, and AIOps practices.
The engineer acts as:
• L1 operational responder for infrastructure/platform incidents
• Ensures issues are diagnosed, contained, escalated appropriately
• Ensures resolution within defined service levels
Key Responsibilities
1. Infrastructure, Cloud & Environment Support
• Support Azure subscriptions, resource groups, networking, and access control
• Monitor GPU environments, container clusters, and AI runtime environments
• Troubleshoot deployment failures across Sandbox, Staging, and Production
2. DevOps & CI/CD Support
• Monitor CI/CD pipelines and resolve build/deployment issues
• Support Git workflows, version control issues, and release rollouts
• Ensure environment configuration consistency
• Validate infrastructure changes post-deployment
• Perform rollback support when required
3. GPU & AI Runtime Operations Support
• Monitor GPU utilization and allocation
• Identify memory saturation and CUDA/container runtime errors
• Support AI model deployment on GPU nodes
• Detect performance bottlenecks affecting inference services
4. API Gateway, WAF & Integrations
• Troubleshoot API gateway routing issues and throttling policies
• Monitor rate limiting and traffic control mechanisms
• Investigate WAF-related blocking incidents
• Support secure external integrations
• Support integrations with enterprise systems:
• Microsoft 365
• SharePoint
• Teams
• Oracle
• Jira
• Troubleshoot authentication issues, webhook failures, and API timeouts
5. Observability & Incident Response
• Monitor service availability, CPU/GPU utilization, memory, storage, and logs
• Detect infrastructure bottlenecks affecting AI workloads
• Act as first-line responder for infrastructure and platform-related incidents (P0–P3)
• Perform triage using logs, metrics, system databases, and environment diagnostics
• Classify incidents by severity and business impact in line with defined SLAs
• Contain and mitigate production‑impacting issues
• Coordinate with L2/L3 teams and vendors
• Escalate with full diagnostic context (logs, metrics snapshots, timestamps, components)
• Track incident lifecycle to closure and ensure no SLA breach
6. Documentation & Knowledge Management
• Maintain and improve:
• Infrastructure runbooks
• Deployment troubleshooting guides
• Environment configuration documentation
• FAQs
• Document recurring failure patterns (deployment errors, GPU saturation, network misconfigurations)
• Handle ITSM/ticketing documentation
• Capture and publish Root Cause Analysis (RCA) summaries for major incidents
• Update environment diagrams and operational checklists after changes
7. Platform Reliability
• Support Kubernetes clusters, Docker containers, and orchestration layers
• Validate scaling, failover, and resilience mechanisms
• Ensure uptime SLAs for AI products, platforms, and APIs
8. Security & Compliance Coordination
• Support IAM, access control, WAF, and network configurations
• Coordinate with security teams for incident remediation
• Ensure adherence to environment governance policies
Required Technical Skills
• Strong hands‑on experience with Azure (AWS/GCP acceptable)
• Experience supporting Kubernetes and Docker environments
• Familiarity with CI/CD tools (Azure DevOps, GitHub Actions, Jenkins)
• Experience with monitoring tools (Azure Monitor, Dynatrace, Grafana)
• Understanding of networking, IAM, API gateways, and WAF
• Experience supporting production cloud environments under SLA constraints
• Familiarity with Infrastructure‑as‑Code concepts (ARM/Terraform)
Experience
• 4–7 years in DevOps, Cloud Operations, Platform Support, or SRE‑aligned roles
• Experience supporting containerized or AI workloads preferred
• Exposure to regulated or government environments advantageous
• Arabic speaker is a plus
#J-18808-Ljbffr