New

Sr Cloud Reliability Engineer (26861)

Supermicro
United States, California, San Jose
980 Rock Avenue (Show on map)
Apr 02, 2026
About Supermicro: Supermicro is a Top Tier provider of advanced server, storage, and networking solutions for Data Center, Cloud Computing, Enterprise IT, Hadoop/ Big Data, Hyperscale, HPC and IoT/Embedded customers worldwide. We are the #5 fastest growing company among the Silicon Valley Top 50 technology firms. Our unprecedented global expansion has provided us with the opportunity to offer a large number of new positions to the technology community. We seek talented, passionate, and committed engineers, technologists, and business leaders to join us. Job Req ID: 26861 Job Summary: As a Sr Cloud Reliability Engineer for our Linux-based AI cloud platforms, you will help us deploy, scale, automate, and ensure high availability, performance, scalability, and security across GPU-accelerated compute clusters, Kubernetes workloads, and supporting storage/network infrastructure. You will bridge Development and Operations by automating infrastructure deployment, enhancing observability, and applying SRE best practices to support reliable AI development environments, AI developer tools, and MLOps platforms in our on-premises and hybrid cloud environments. This role will also support the operational enablement of AI development by helping manage AI platform services, developer access, API key governance, MCP servers, and other shared tooling required for secure and reliable use of modern AI workflows. Essential Duties and Responsibilities: Includes the following essential duties and responsibilities (other duties may also be assigned): * Cloud Infra Automation: Design and provision cloud infrastructure using Infrastructure as Code (Terraform, Ansible, or Helm) on bare metal or cloud platforms. Develop custom automation and tooling in Python or Go to extend deployment workflows and streamline operations. * Platform Reliability: Deploy, scale, maintain, and optimize uptime for AI cloud services including GPU clusters, Kubernetes (K8s), and storage systems (e.g., Ceph, Vast, DDN, or Weka). Understand the tools required to benchmark and assure consistent application performance. * AI Platform Operations: Support reliable operation of shared AI platform services used by development teams, including model-serving infrastructure, AI development environments, inference services, and internal AI tooling deployed on GPU-based on-premises infrastructure. * AI Developer Tooling Support: Help enable and support secure enterprise use of AI developer tools and services, including API-based AI platforms, developer integrations, and related service configurations used in software development workflows. * MCP / Tool Integration Services: Deploy, configure, secure, and support MCP servers and related service endpoints that enable controlled integration between AI tools, development environments, internal systems, and approved data/services. * Access and API Key Governance: Establish and help operate secure processes for AI service credentials, API keys, tokens, secrets rotation, usage controls, and environment-based access management to support development, testing, and production governance. * Monitoring & Alerting: Implement observability tools (e.g., Prometheus, Grafana, ELK, Loki, Fluentd) to monitor system health and alert on anomalies, service degradation, or abnormal GPU / AI platform behavior. * Capacity Planning: Analyze usage trends and forecast infrastructure needs to support AI workloads, large-scale model training/inference, and shared developer platform demand. * Incident Management: Lead root cause analysis and resolution for system outages or degraded performance. Define and maintain service level objectives (SLOs), indicators (SLIs), and agreements (SLAs) aligned with uptime and performance goals. * CI/CD Integration: Collaborate with DevOps and MLOps teams to ensure reliable delivery pipelines using GitLab CI/CD, ArgoCD, or similar tools. * Security & Compliance: Harden Linux systems, manage TLS certificates, and enforce secure access controls via Role-Based Access Control (RBAC), LDAP-integrated SSO, TLS, secrets management, and network segmentation policies. * Documentation & Playbooks: Maintain clear, version-controlled documentation, including architecture diagrams, runbooks, API key management procedures, MCP service standards, and incident response playbooks to support cross-team knowledge transfer and rapid onboarding. Qualifications: Bachelor's degree in Computer Science, Engineering, or a related field-or equivalent experience and 8 years of experience in the areas below Proficiency in Linux (Ubuntu, RHEL/CentOS), containers (Docker, Podman), and orchestration (Kubernetes) Experience managing GPU compute clusters (NVIDIA / CUDA, AMD / ROCm) Hands-on experience with observability tools (Prometheus, Grafana, Loki, ELK, etc.) Strong scripting and coding skills (Bash, Python, or Go) Experience supporting shared platform services for developers in production environments Familiarity with AI development workflows and the operational needs of teams building with AI tools, APIs, and model services Experience with secrets, credentials, certificates, or API key management in enterprise environments Exposure to secure multi-tenant environments and zero trust architecture Familiarity with network protocols, DNS, DHCP, BGP, RoCEv2, and InfiniBand or high-throughput Ethernet fabrics Excellent collaboration and communication skills for cross-team, partner, and customer initiatives Preferred Qualifications: Understanding of AI/ML reference architectures and experience with workflows, MLflow, or Kubeflow Familiarity with AI developer platforms, model APIs, inference services, and secure integration patterns for enterprise AI use cases Experience deploying or supporting MCP servers or similar integration services for AI tool ecosystems Familiarity with API key lifecycle management, secrets vaults, token-based authentication, and environment-based access controls Familiarity with storage backends optimized for AI Prior experience in bare-metal provisioning via PXE, Ironic, or Foreman Understanding of NVIDIA GPU telemetry and NCCL testing for performance benchmarking Familiarity with ITIL processes or structured change management in production systems is a plus Certifications: CKA, CKAD, Linux+, or related credentials Salary Range $145,000 - $165,000 The salary offered will depend on several factors, including your location, level, education, training, specific skills, years of experience, and comparison to other employees already in this role. In addition to a comprehensive benefits package, candidates may be eligible for other forms of compensation, such as participation in bonus and equity award programs. EEO Statement Supermicro is an Equal Opportunity Employer and embraces diversity in our employee population. It is the policy of Supermicro to provide equal opportunity to all qualified applicants and employees without regard to race, color, religion, sex, sexual orientation, gender identity, national origin, age, disability, protected veteran status or special disabled veteran, marital status, pregnancy, genetic information, or any other legally protected status.