Responsibilities
• GPU Infrastructure: Deploy and maintain high-performance GPU clusters.
• AI Lifecycle: Manage the full lifecycle of AI services: inference deployment (Triton, vLLM, custom services), autoscaling, and seamless rollout/rollback strategies.
• Data Management: Manage model storage, artifact versioning, caching, and high-speed data access via S3-compatible storage.
• Observability: Monitor performance metrics including latency, throughput, error budgets, resource limits, and cost/performance ratios.
• High Availability: Ensure fault tolerance for payment services (SLA/SLO management, redundancy, Disaster Recovery planning, and regular recovery testing).
• Fintech-Grade Security: Implement secrets management, HSM/managed KMS integration, infrastructure hardening, and audit logging.
• Secure CI/CD: Build secure pipelines featuring artifact signing, vulnerability scanning, policy gates, and isolated environments.
• Node Operations: Deploy and maintain crypto nodes (Full, Archive, RPC) across various networks.
• Automation: Automate node updates, synchronization monitoring, and health checks.
• Storage & Performance: Manage disk I/O (IOPS/RAID), protect RPC endpoints, and manage access controls.
• Metrics: Monitor for sync lags, chain forks, and consensus issues.
Requirements
• 5+ years in DevOps, SRE, or Platform Engineering (Fintech experience is mandatory)
• Deep expertise in Linux, networking (TCP/IP, DNS, TLS, routing), and complex troubleshooting
• Production experience with K8s, Helm, Ingress, autoscaling, network policies, and resource management
• Proficiency in GitHub Actions, GitLab CI, or Jenkins
• Hands-on experience with Prometheus + Grafana, logging (Loki/ELK), and tracing (OpenTelemetry/Jaeger)
• Experience with GPU clusters and ML stacks (NVIDIA drivers, CUDA, MIG, GPU monitoring)
• Production-level operation of Postgres, Redis, Kafka, or RabbitMQ
• Practical knowledge of Vault, KMS, RBAC, OPA/Gatekeeper/Kyverno, Trivy, and SBOM
About the Company
The company is a fintech innovator operating a proprietary Payment Service Provider (PSP) platform, advanced AI infrastructure (including on-prem GPU/bare-metal servers), and a dedicated crypto division focused on node infrastructure. They operate a multi-cloud environment (AWS/Hetzner/DigitalOcean) and are looking for a seasoned Engineer to build and maintain a resilient, secure, and scalable platform that powers production payments and high-performance AI services.