DevOps Support Engineer

NorthBay Solutions LLC

United Arab Emirates | First seen: 08 Mar 2026 01:30 | Jooble

Posted: 1 day ago Full-time

Apply Links

Click to view the job description. Download the JD manually if needed.

Jooble Jobg8

View on Google Jobs

Job Description

DevOps Support Engineer
About AI Factory

The AI Factory operates sovereign AI infrastructure including:
• GPU clusters
• Cloud subscriptions
• Containerized workloads
• API gateways
• Multi-environment deployments (Sandbox → Staging → Production)

The DevOps Support Engineer ensures:
• Infrastructure stability
• Deployment reliability
• Operational continuity for AI workloads
Role Overview

Employment Type: Full-time

Work Arrangement: Onsite (Applicants based outside the UAE are required to relocate)

The DevOps Support Engineer is responsible for supporting:
• Cloud infrastructure
• CI/CD pipelines
• Containerized AI workloads
• API gateways
• Production environments

The role focuses on:
• Platform stability
• Environment health
• Deployment reliability
• Infrastructure troubleshooting
• Structured incident management
• Environment discipline
• Production governance

This is an operational reliability role aligned with modern DevOps, SRE, and AIOps practices.

The engineer acts as:
• L1 operational responder for infrastructure/platform incidents
• Ensures issues are diagnosed, contained, escalated appropriately
• Ensures resolution within defined service levels
Key Responsibilities
1. Infrastructure, Cloud & Environment Support
• Support Azure subscriptions, resource groups, networking, and access control
• Monitor GPU environments, container clusters, and AI runtime environments
• Troubleshoot deployment failures across Sandbox, Staging, and Production
2. DevOps & CI/CD Support
• Monitor CI/CD pipelines and resolve build/deployment issues
• Support Git workflows, version control issues, and release rollouts
• Ensure environment configuration consistency
• Validate infrastructure changes post-deployment
• Perform rollback support when required
3. GPU & AI Runtime Operations Support
• Monitor GPU utilization and allocation
• Identify memory saturation and CUDA/container runtime errors
• Support AI model deployment on GPU nodes
• Detect performance bottlenecks affecting inference services
4. API Gateway, WAF & Integrations
• Troubleshoot API gateway routing issues and throttling policies
• Monitor rate limiting and traffic control mechanisms
• Investigate WAF-related blocking incidents
• Support secure external integrations
• Support integrations with enterprise systems:
• Microsoft 365
• SharePoint
• Teams
• Oracle
• Jira
• Troubleshoot authentication issues, webhook failures, and API timeouts
5. Observability & Incident Response
• Monitor service availability, CPU/GPU utilization, memory, storage, and logs
• Detect infrastructure bottlenecks affecting AI workloads
• Act as first-line responder for infrastructure and platform-related incidents (P0–P3)
• Perform triage using logs, metrics, system databases, and environment diagnostics
• Classify incidents by severity and business impact in line with defined SLAs
• Contain and mitigate production‑impacting issues
• Coordinate with L2/L3 teams and vendors
• Escalate with full diagnostic context (logs, metrics snapshots, timestamps, components)
• Track incident lifecycle to closure and ensure no SLA breach
6. Documentation & Knowledge Management
• Maintain and improve:
• Infrastructure runbooks
• Deployment troubleshooting guides
• Environment configuration documentation
• FAQs
• Document recurring failure patterns (deployment errors, GPU saturation, network misconfigurations)
• Handle ITSM/ticketing documentation
• Capture and publish Root Cause Analysis (RCA) summaries for major incidents
• Update environment diagrams and operational checklists after changes
7. Platform Reliability
• Support Kubernetes clusters, Docker containers, and orchestration layers
• Validate scaling, failover, and resilience mechanisms
• Ensure uptime SLAs for AI products, platforms, and APIs
8. Security & Compliance Coordination
• Support IAM, access control, WAF, and network configurations
• Coordinate with security teams for incident remediation
• Ensure adherence to environment governance policies
Required Technical Skills
• Strong hands‑on experience with Azure (AWS/GCP acceptable)
• Experience supporting Kubernetes and Docker environments
• Familiarity with CI/CD tools (Azure DevOps, GitHub Actions, Jenkins)
• Experience with monitoring tools (Azure Monitor, Dynatrace, Grafana)
• Understanding of networking, IAM, API gateways, and WAF
• Experience supporting production cloud environments under SLA constraints
• Familiarity with Infrastructure‑as‑Code concepts (ARM/Terraform)
Experience
• 4–7 years in DevOps, Cloud Operations, Platform Support, or SRE‑aligned roles
• Experience supporting containerized or AI workloads preferred
• Exposure to regulated or government environments advantageous
• Arabic speaker is a plus
#J-18808-Ljbffr

Find People at NorthBay Solutions LLC

Hiring Manager at NorthBay Solutions LLC People in similar roles at NorthBay Solutions LLC Head of Engineering at NorthBay Solutions LLC VP of Engineering at NorthBay Solutions LLC Head of DevOps at NorthBay Solutions LLC Director of DevOps at NorthBay Solutions LLC Engineering Manager at NorthBay Solutions LLC CTO at NorthBay Solutions LLC

Notes

Notification History

sent 08 Mar 01:30

samsbabg@gmail.com

Metadata

Source: google_jobs

Via: Jooble

Search Query: Senior DevOps Engineer

First Seen: 08 Mar 2026 01:30 UTC

Last Seen: 09 Mar 2026 19:30 UTC

Source Job ID: eyJqb2JfdGl0bGUiOiJEZXZPcHMgU3VwcG9ydCBFbmdpbmVlciIsImNvbXBhbnlfbmFtZSI6Ik5vcnRoQmF5IFNvbHV0aW9ucyBMTEMiLCJhZGRyZXNzX2NpdHkiOiJVbml0ZWQgQXJhYiBFbWlyYXRlcyIsImh0aWRvY2lkIjoiQXpYaTk4REh0dmI2eDZfc0FBQUFBQT09IiwidXVsZSI6IncrQ0FJUUlDSVVWVzVwZEdWa0lFRnlZV0lnUlcxcGNtRjBaWE0iLCJobCI6ImVuIn0=