Back to Dashboard

Site Reliability Engineer – Datacentre AI Engineering - Riyadh, KSA

Qualcomm

Riyadh Saudi Arabia | First seen: 06 Feb 2026 21:05 | Qualcomm Careers

Full-time
Apply Links

Click to view the job description. Download the JD manually if needed.

Qualcomm Careers Trabajo.org Jobs Search In Saudi Arabia On Need-A-Job.com
Job Description
About Us Qualcomm is enabling a world where everyone and everything can be intelligently connected. You interact with products and technologies made possible by Qualcomm every day, including 5G-enabled smartphones that double as pro-level cameras and gaming devices, smarter vehicles and cities, and the technology behind the smart, connected factories that manufactured your latest purchase. Qualcomm 5G and AI innovations are the power behind the connected intelligent edge. You’ll find our technologies behind and inside the innovations that deliver significant value across multiple industries and to billions of people every day. About the Role We are recruiting for a Site Reliability Engineer – Datacentre AI Engineering at Qualcomm Technologies, Inc., located in Riyadh, Saudi Arabia. This role centres on designing, maintaining, and scaling large-scale AI inference systems in a datacentre environment. You will support critical AI use cases, ensuring that Qualcomm’s infrastructure is robust, reliable, and scalable for advanced machine learning workloads. Key Responsibilities will include AI Infrastructure • Design and maintain large-scale AI Inference systems supporting critical AI use cases. • Ensure reliability, operability, and scalability of the Qualcomm data-centre cluster • Build software tools and ecosystems around AI software stacks. AI & ML Engineering • Analyse software requirements and consult with architecture and hardware engineers. • Hands-on experience in building Agentic AI solutions, LLM orchestration and agentic AI libraries. • Collaborate with model, systems & software teams to improve model performance on AI100 deployments • Identify features that optimize workloads for multi-SoC and multi-card systems Site Reliability Engineering (SRE) • Implement SRE fundamentals: incident management, monitoring, performance optimization • Hands-on experience with MLOps tools and practices, ensuring seamless integration of ML models into production • Establish operational maturity frameworks and sustainable incident response protocols. Observability & Tooling • Build tools and frameworks to improve observability and define reliability metrics. • Monitor system health using Prometheus, Grafana, Cloudwatch, and custom telemetry. • Create and maintain documentation and knowledge base articles. Automation & CI/CD • Design automation tools to reduce manual processes and operational overhead. • Ensure CI/CD reliability for agent deployment cycles. • Apply Infrastructure as Code practices using tools like Terraform CDK.  Required Skillset includes: AI & Deep Learning • Experience with LLMs, NLP, Vision, Audio, and Recommendation systems. • Proficiency with LLM inference concepts: token streaming, batching, KV cache. • Proficiency in PyTorch, TensorFlow, JAX, and Ray. • Familiar with GPU / TPU compute, ML frameworks, checkpointing and distributed inferencing AI Agent Operations • Experience supporting GenAI or agentic AI applications in production. • Familiarity with LLM orchestration, prompt reliability, and RAG systems. • Exposure to LangChain, AutoGen, and similar agent orchestration frameworks. Programming & Software Design • Strong programming skills in Python with experience in PyTorch • Scripting (Python, Bash), configuration management (Ansible/Terraform), orchestration Systems & Infrastructure • Strong Linux fundamentals: shell, systemd, containers, networking (TLS, DNS, HTTP/2, gRPC). • Expertise in Slurm (configuration, scheduling, plugins/extensions) or equivalent • Good knowledge of networking (RDMA, InfiniBand, RoCE, high-throughput, low-latency networks) • Experience operating and scaling distributed systems with high availability. Observability & Monitoring • Hands-on experience with Prometheus, Grafana, ELK, Loki, Datadog, SIP, Homer. • Exposure to hardware health monitoring and system reliability. DevOps & SRE Practices • Deep understanding of SDLC, release management, and system reliability. • Familiarity with CI/CD pipelines (Jenkins, GitLab) and Infrastructure as Code (Terraform CDK). Qualifications & Experience: • Bachelor's / Masters degree in Engineering, Machine learning/ AI, Information Systems, Computer Science, or related field. • 4-5 years’ of Software Engineering or related work experience. What's on Offer Apart from working with great people, we offer the below: • Salary including housing & transport allowance • Stock (RSU's) and performance related bonus • 16 weeks fully paid Maternity Leave • 6 weeks fully paid Paternity Leave • Employee stock purchase scheme • Child Education Allowance • Relocation and immigration support (if needed) • Life and Medical Insurance • Live+ Well Reimbursement for health and recreational membership fees
Notes
Notification History
failed 06 Feb 21:05
HTTP 429 error: Unable to create record: Account AC220246b462643a27d892cf705b810f79 exceeded the 50 daily messages limit
Metadata

Source: google_jobs

Via: Qualcomm Careers

Search Query: Senior SRE

First Seen: 06 Feb 2026 21:05 UTC

Last Seen: 06 Feb 2026 23:00 UTC

Source Job ID: eyJqb2JfdGl0bGUiOiJTaXRlIFJlbGlhYmlsaXR5IEVuZ2luZWVyIOKAkyBEYXRhY2VudHJlIEFJIEVuZ2luZWVyaW5nIC0gUml5YWRoLCBLU0EiLCJjb21wYW55X25hbWUiOiJRdWFsY29tbSIsImFkZHJlc3NfY2l0eSI6IlJpeWFkaCBTYXVkaSBBcmFiaWEiLCJodGlkb2NpZCI6ImhFM2JxWW5lVmRRS0xCdFlBQUFBQUE9PSIsInV1bGUiOiJ3K0NBSVFJQ0lNVTJGMVpHa2dRWEpoWW1saCIsImhsIjoiZW4ifQ==