Senior Platform Engineer

Whitefiber HPC, Inc
Seattle, Washington
Full Time

Email Address

Apply Now

Title: Senior or Staff Platform EngineerLocation: FULLY remote!Salary: $175k-$275k base + RSUs + Full BenefitsRequirements: 3+ years in Systems Engineering or HPC Infrastructure, strong Linux and bare-metal GPU experience, NVIDIA DGX/HGX, InfiniBand/RoCE, and automation with Python or GoWe build the high-performance, bare-metal GPU infrastructure that powers modern AI. Our team designs and operates large-scale NVIDIA DGX/HGX clusters, high-speed networking, and the automation that turns complex hardware into a reliable, production-ready platform. We work directly with the metal: provisioning nodes, tuning Linux, integrating InfiniBand/RoCE, and building the tooling that enables fast, secure, and scalable AI workloads.If you want to help shape the systems that make large-scale AI possible, this is where you will do it. We are looking for a Senior or Staff-level Platform Engineer to architect and operate the high-performance GPU infrastructure that powers next-generation AI systems. This is not a traditional cloud role - you will own the full lifecycle of bare-metal GPU clusters, from "empty rack" to production-grade Kubernetes, and build the automation that makes large-scale AI infrastructure reliable, observable, and secure.If you thrive at the intersection of hardware, distributed systems, and automation - and you love solving the problems that live between teams - you will feel right at home here.What You'll be DoingDesign and operate container orchestration platforms optimized for NVIDIA DGX/HGX-class hardware.Build bare-metal provisioning systems (PXE, Ironic, MAAS) to bring GPU clusters online at scale.Manage GPU lifecycle: driver stacks, CUDA/kernel compatibility, MIG slicing, and performance tuning.Partner with Network Engineering and DCOps to align physical infrastructure with software orchestration.Build automation and internal tooling in Go or Python to streamline cluster operations.Implement Terraform/Ansible-based IaC for fully auditable, repeatable infrastructure.Design high-resolution observability stacks (Prometheus/Grafana, DCGM, VictoriaMetrics).Participate in a specialized on-call rotation supporting GPU workloads and core platform services.What You Need for this Position7+ years in systems, platform, or distributed systems engineering (10+ for Staff).Expert-level Linux knowledge: kernel modules, sysctl tuning, hugepages, container runtimes.Hands-on experience bootstrapping Kubernetes or SLURM on physical hardware.Strong proficiency in Go (preferred) or Python for systems-level automation.Deep familiarity with NVIDIA GPU ecosystems (drivers, CUDA, MIG).Working knowledge of InfiniBand or RoCEv2 networking and NCCL performance tuning.Experience building observability pipelines for hardware-accelerated environments.Ability to troubleshoot complex, multi-layered issues across hardware, networking, and orchestration.Strong cross-team communication - you're the "glue" between Network, DCOps, and Software.Bonus PointsExperience with SLURM, Kubeflow, or distributed PyTorch.Integrating vendor APIs (NetBox, Vault, GitLab CI, etc.) into unified workflows.Infrastructure testing, chaos engineering, or cluster-level integration test suites.Designing telemetry aggregation across hardware, networking, and environmental systems.What's In It for You$175k - $275k/year DOERSU's5 weeks PTO401k w/ matchComprehensive Benefit Plan

Job ID: 523185320

Originally Posted on: 6/1/2026

Email Address

Apply Now