After 18 months of running GPU workloads on Kubernetes, we made a decision that surprised even us: we migrated our entire inference infrastructure to HashiCorp Nomad. This wasn't a whim โ it was the result of hitting fundamental scaling limits with how K8s handles heterogeneous hardware.
The Breaking Point
Kubernetes was designed for stateless web services. It's brilliant at that. But when you're scheduling workloads that need specific GPU models, fractional GPU allocation, NUMA-aware memory placement, and topology-aware networking, you're fighting the scheduler instead of working with it.
Our pain points: the K8s device plugin framework treats GPUs as opaque integer resources โ you can't express "2 A100s on the same NUMA node with NVLink interconnect." Scheduling latency averaged 4 minutes due to device plugin, topology manager, and pod admission overhead. Bin-packing was terrible โ 35-40% GPU waste.
The moment we realized we'd written more custom K8s schedulers than actual application code, we knew something had to change.
Why Nomad
Nomad's scheduler maintains a rich understanding of node topology. GPU types, interconnects, memory layout โ all first-class scheduling constraints. Results: GPU bin-packing improved 23% (~$2.1M annual savings), deploy latency dropped from ~4min to ~45s, operational overhead cut roughly in half.
The Trade-offs
Nomad's ecosystem is a fraction of Kubernetes'. We lost hundreds of operators, CRDs, and integrations. Service mesh needed custom work. Hiring is harder โ every infra engineer knows K8s, not Nomad.
Our Recommendation
For GPU-heavy inference: Nomad is better. For general microservices: keep K8s. For many orgs, a hybrid approach makes sense. We've open-sourced our Nomad job specs and GPU bin-packing plugin. Links in thread below.