ML Kernel Performance Engineer, Edge AI and Science

  • Amazon
  • Vancouver, British Columbia
  • Full Time

Amazon Devices is an inventive research and development company that designs and engineers high-profile consumer products like the Kindle family, Fire Tablets, Fire TV, Health & Wellness devices, Amazon Echo, and Astro. We are building the next generation of edge AI capabilities through our advanced compression platform and custom neural accelerator silicon.

Within Edge AI & Science, the AI Platform team builds a compression platformthe first of its kindenabling 20-100x neural network compression for edge and cloud deployment. As model sizes grow from billions to hundreds of billions of parameters, compute efficiency becomes the single largest return on engineering investment during training. The gap between eager-mode Python and optimized GPU execution is where months of training time are won or lost.

We are looking for an ML Kernel Performance Engineer to work at the hardware-software boundary of this platform, crafting high-performance CUDA and Triton kernels that make our compression algorithms run at peak efficiency during training, fine-tuning, and inference. You will build the tooling and kernel libraries that democratize GPU performance optimization across the team, enabling scientists and engineers to profile, diagnose, and fix kernel bottlenecks without needing to be CUDA experts themselves.

Working alongside compression scientists and platform engineers, you will ensure that novel quantization schemes (ternary, nonary, mixed-precision) and sparse computation patterns translate into real throughput gains on GPU hardware. Your work will directly accelerate every training run in the organization and unlock deployment of compressed models to both edge devices and cloud inference.

Key job responsibilities
Design and implement high-performance CUDA and Triton kernels for quantization-aware training, sparse matrix operations, and low-bit inference on modern GPU accelerators

Analyze and optimize kernel-level performance for compression training workloads, conducting detailed performance analysis using profiling tools to identify and resolve bottlenecks that slow model training from days to weeks

Implement kernel-level optimizations such as operator fusion, tiling, memory access pattern optimization, and scheduling for compression-specific compute patterns

Build a kernel development harness that enables any team member to profile kernel performance, test forward/backward accuracy, and validate at production scale, lowering the bar from "CUDA expert" to "any engineer with agents"

Maintain and extend the team's training kernels library with clean interfaces, CI, and examples that enable scientists to contribute kernel improvements alongside platform engineers

Collaborate closely with Applied Scientists, compiler engineers, and hardware architects to co-design ML-centric solutions that unify software and hardware for both cloud and edge deployment

Develop inference kernels for cloud deployment (custom backends for quantized models that keep weights packed in memory and reconstruct on the fly for compute)

Build and maintain performance regression tests and benchmarking infrastructure that track kernel efficiency as models scale from billions to hundreds of billions of parameters

A day in the life
A scientist files a ticket: "QAT training on our large model is 4x slower than expected." You pull up the profiler, identify that a custom quantizer kernel is thrashing shared memory at scale, write a Triton replacement that tiles correctly for the layer shapes at that model size, validate accuracy in the test harness, and push it to the kernels repo. By end of day, the training run that was taking four days now takes one.

You will also build the tooling that makes this workflow repeatable by others. You will participate in design discussions with Applied Scientists, translate their algorithmic ideas into efficient GPU implementations, and work in a startup-like environment where every engineering hour directly accelerates the team's ability to ship compressed models.

About the team
The AI Platform team builds Amazon's neural network compression platform. We compress models using knowledge distillation, network restructuring, and advanced quantization to achieve 20-100x compression while preserving model quality. Our platform packages these into automated pipelines that deploy to both custom edge silicon and GPU-based cloud inference.

As model sizes grow, the proprietary advantage shifts from the science to the software (making it work at hundreds of billions of parameters is the moat). GPU kernel performance is the biggest single lever on training throughput, and we expect AI-assisted development tooling to significantly multiply engineering productivity, meaning a small team with the right harness can operate at the scale of a much larger one.

The ML Kernel Performance Engineer bridges science and platforms: you turn algorithmic innovations into production-grade GPU code that runs at scale. You will work alongside Applied Scientists, compiler engineers, hardware architects, and platform developers in a small, agile team building the next generation of edge AI for Amazon's consumer products.

Basic Qualifications

- 3+ years of non-internship professional software development experience
- 2+ years of non-internship design or architecture (design patterns, reliability and scaling) of new and existing systems experience
- Experience with CUDA kernels or ML/low-level kernels, or experience in developing and deploying LLMs in production on GPUs, Neuron, TPU or other AI acceleration hardware
- Experience with programming languages such as Python, Java, C++

Preferred Qualifications

- Bachelor's degree in computer science or equivalent
- 3+ years of full software development life cycle, including coding standards, code reviews, source control management, build processes, testing, and operations experience
- Experience with GPU kernel optimization and GPGPU computing (CUDA, Triton, SYCL, or ROCm)
- Proficiency in low-level performance optimization for GPUs
- Understanding of GPU memory hierarchies and optimization strategies (shared memory, L1/L2 cache, register pressure, memory coalescing)
- Experience developing high-performance libraries for ML or HPC applications
- Knowledge of ML frameworks (PyTorch, TensorFlow) and their GPU backends
- Experience implementing custom PyTorch operators (torch.autograd.Function, C++ extensions)
- Experience with parallel programming and optimization techniques
- Background in neural network compression (quantization, pruning, knowledge distillation, low-rank factorization)
- Knowledge of mixed-precision training and inference (FP16, BF16, FP8, INT8, INT4)
- Experience with inference optimization (TensorRT, ONNX Runtime, vLLM, or similar)
- Familiarity with Transformer architectures, attention mechanisms, and their compute/memory profiles
- Experience with AWS Trainium/Inferentia or the Neuron Kernel Interface (NKI)
- Experience with edge deployment, model compilation, or hardware-aware optimization

Amazon is an equal opportunity employer and does not discriminate on the basis of protected veteran status, disability, or other legally protected status.

Our inclusive culture empowers Amazonians to deliver the best results for our customers. If you have a disability and need a workplace accommodation or adjustment during the application and hiring process, including support for the interview or onboarding process, please visit for more information. If the country/region youre applying in isnt listed, please contact your Recruiting Partner.

The base salary range for this position is listed below. As a total compensation company, Amazon's package may include other elements such as sign-on payments and restricted stock units (RSUs). Final compensation will be determined based on factors including experience, qualifications, and location. Amazon offers comprehensive benefits including health insurance (medical, dental, vision, prescription, basic life & AD&D insurance), Registered Retirement Savings Plan (RRSP), Deferred Profit Sharing Plan (DPSP), paid time off, and other resources to improve health and well-being. We thank all applicants for their interest, however only those interviewed will be advised as to hiring status.

CAN, BC, Vancouver - 114,800.00 - 191,800.00 CAD annually
Job ID: 523330953
Originally Posted on: 6/2/2026

Want to find more Technology opportunities?

Check out the 165,238 verified Technology jobs on iHireTechnology