On-prem Platform Engineer

  • TekGlobal
  • Charlotte, North Carolina
  • Full Time

Role : On-prem Platform Engineer

Location: Charlotte, NC

Key Skills:

Must-Have Skills (Mandatory Keywords)

LLM Inference & Optimization

  • vLLM, TensorRT-LLM, Triton Inference Server, SGLang
  • Inference optimization techniques:
    • Continuous batching
    • Speculative decoding
    • KV cache / Prefix caching
  • Model optimization:
    • FP8, AWQ, GPTQ

Distributed & GPU Systems

  • Tensor parallelism and large model scaling
  • CUDA, NCCL, GPU architecture
  • GPU partitioning & optimization (MIG)

Kubernetes & ML Serving

  • Kubernetes-based ML serving platforms
  • KServe, OpenShift AI
  • Helm charts, Operators, platform automation

GPU Orchestration

  • Run:AI or similar GPU scheduling/orchestration platforms
  • Multi-tenant GPU workload management

Platform Engineering

  • Experience building internal AI/ML platforms (on-prem or hybrid)
  • Strong automation and system design mindset

Observability & Performance

  • Prometheus, Grafana
  • ML observability (model latency, throughput, drift, resource utilization)
  • Performance benchmarking and tuning

Good to Have / Preferred Skills

  • Experience with LLMOps / GenAI pipelines
  • Exposure to hybrid cloud (on-prem + Google Cloud Platform/Azure integration)
  • Familiarity with Inferentia / alternative accelerators
  • Knowledge of service mesh / networking in GPU clusters

· Build, configure, and operate onprem Kubernetes/OpenShift AI platforms for deploying and serving GenAI models and LLM inference workloads.

· Design and optimize highperformance inference stacks using vLLM, TensorRTLLM, Triton Inference Server, SGLang, and advanced techniques (continuous batching, speculative decoding, KV caching).

· Manage GPU orchestration and capacity using Run:AI, MIG, CUDA/NCCL, and tensor parallelism to maximize utilization and throughput.

· Deploy and operate Kubernetes ML serving frameworks (KServe, Helm, Operators) for scalable, reliable model serving.

· Drive inference optimization and benchmarking, leveraging FP8, AWQ, GPTQ, and performance tools such as GuideLLM and Locust.

· Implement observability and ML monitoring using Prometheus, Grafana, Arize AI, ensuring SLA/SLO compliance for GenAI services.

· Collaborate with ML and research teams to onboard new models, tune inference performance, and productionize GenAI use cases.

Job ID: 523336680
Originally Posted on: 6/2/2026

Want to find more Technology opportunities?

Check out the 165,053 verified Technology jobs on iHireTechnology