Senior AI Platform Engineer

Drata
San Francisco, California
Full Time

Email Address

Apply Now

Senior AI Platform EngineerLocation: San Francisco, CA - Hybrid 3 days a week(Near Oracle Park)Position OverviewWe are seeking a Senior AI Platform Engineer to design, build, and operate a scalable, secure platform that enables ML engineers and data scientists to develop, deploy, and monitor large-scale AI/LLM applications. The role focuses on infrastructure for model training and inference, integration with LLM APIs, vector databases and orchestration of AI workflows across cloud environments with infrastructure-as-code and strong automation.Key ResponsibilitiesDesign, build, and maintain scalable AI/ML infrastructure for training, evaluation, and production inference across cloud providers.Implement and operate model serving and inference platforms integrating LLM APIs (OpenAI, Anthropic, etc.) and self-hosted models.Develop and maintain vector database integrations (e.g., Pinecone, Milvus, FAISS) for semantic search and retrieval-augmented generation.Author and maintain Infrastructure-as-Code (Terraform, CloudFormation, or equivalent) and CI/CD pipelines to automate environment provisioning and deployments.Build and maintain containerized workloads and orchestration on Kubernetes (or managed K8s services) for training and inference.Implement AI orchestration and workflow tools (e.g., Airflow, Dagster, Prefect, Kubeflow, Ray) to schedule and manage end-to-end ML pipelines.Establish observability, logging, metrics, and alerting for model performance, data drift, latency, and cost using tools like Prometheus, Grafana, and application tracing.Collaborate closely with ML engineers, data scientists, security and SRE teams to ensure reproducible experiments, secure deployments, and cost-effective operations.Optimize system performance and cost at scale, including autoscaling, batching, sharding, and caching strategies for low-latency inference.Mentor junior engineers, own architecture and operational best practices, and contribute to technical roadmap and cross-functional planning.Qualifications5+ years of engineering experience building cloud-native infrastructure for production systems; 3+ years focused on AI/ML platforms or MLOps.Strong proficiency in Python for automation, tooling, and integrations with ML frameworks and APIs.Hands-on experience integrating and operating LLM APIs and model serving frameworks for inference at scale.Deep experience with cloud infrastructure (AWS, GCP, or Azure) including networking, IAM, managed services, and cost controls.Proficiency with Infrastructure-as-Code tools (Terraform, CloudFormation, Pulumi) and experience designing reproducible provisioning processes.Experience with Kubernetes, Docker, and container orchestration for machine learning workloads.Practical knowledge of vector databases and embedding pipelines (e.g., Pinecone, Milvus, FAISS) and retrieval-augmented generation patterns.Experience with AI orchestration and workflow tools (e.g., Airflow, Dagster, Prefect, Kubeflow, Ray) to build repeatable ML pipelines.Strong observability and SRE practices for ML systems: monitoring, logging, tracing, and incident response.Solid engineering fundamentals: distributed systems, networking, security, testing, and CI/CD best practices.Excellent communication and collaboration skills; experience mentoring engineers and driving cross-team initiatives.Bachelors degree in Computer Science, Engineering, or equivalent experience; advanced degree or relevant certifications preferred.

Job ID: 523188286

Originally Posted on: 6/1/2026

Email Address

Apply Now