Systems

How to Share a GPU Without Starting a War

MJ Horowitz

27 Jun 2025 — 4 min read

The GPUs are here, but not enough of them.

That's the truth for most mid-sized biotech and life sciences compute clusters, anyway.

You’ve got your eight A100s, maybe sixteen if you're lucky, sitting in a rack next to your fast storage and your CPU lanes. But that’s not enough, not when three teams are fighting for time with one running a round of GROMACS on a new polymer simulation, another training a small molecule activity prediction model, and a third pushing a finicky LLM through inference on the back of a proprietary in-house dataset.

And here's the kicker. Each of those jobs grabs a whole GPU and parks on it, even if it only needs a fraction. The LLM sits idle for most of its forward pass and the molecular dynamics code spikes in bursts. That one ML job loads half the GPU's memory and then ticks along at 20% utilization. You get to watch the job queue grow while cards sit cold.

It’s a resource tragedy wrapped in scheduling entropy and it’s where GPU pooling, long confined to academic tinkering or hyperscaler excess, finally starts to matter in real-world science infrastructure.

At first glance, the obvious fix seems to be smarter scheduling. Let Slurm or Kubernetes do more aggressive backfilling. Slice your jobs, micromanage resource requests, give users guidance on GPU size matching.

But that doesn't solve the root of the problem, which is that the unit of allocation is the whole GPU. These are binary decisions because either the job gets a card, or it doesn't. There’s no easy middle ground, no concept of giving a job just some of the GPU unless you're willing to go all-in on containerization and user-level CUDA hackery, or invest in rigid vendor tools like NVIDIA's MIG (Multi-Instance GPU) feature. Which works until it doesn’t.

MIG is static, useful for rigid slices. But most life sciences workloads aren’t. You can’t predict whether your gen chemistry model’s next batch will spike memory or coast. You can’t guarantee your AlphaFold variant will keep its compute profile consistent as it switches targets. What you need is elasticity and the ability to dynamically give each job what it needswithout sacrificing isolation or burning cycles on context switching.

That’s where an approach like gPooling (PDF) enters the conversation.

Born from the constraints of real-world cluster use, it works by hijacking the GPU driver layer itself not the user CUDA API or orchestration framework, but the actual calls going from the OS kernel to the hardware.

This is critical, because it gives the pooling system visibility into memory allocations, compute commands, and low-level channel control—allowing it to carve a single GPU into virtual slices (vGPUs) with tight boundaries and real-time feedback. You can give one job 30% compute and 4 GB of memory, another 60% and 10 GB, and a third just a sliver, all on the same card. And you can change those allocations on the fly based on actual usage patterns.

In life sciences environments, this kind of flexibility is golden.

Imagine a shared compute rack in a translational genomics lab. One team’s running PyTorch-based fine-tuning on transcriptomic embeddings. Another is pushing through a large-scale simulation of ligand-protein docking, maybe with PLUMED on top of GROMACS. A third is iterating on small model inferences using a distilled version of Baichuan or BioGPT.

With gPooling-style vGPU slicing, those jobs don’t have to wait. They can coexist each bounded, isolated, and resource-capped, but all running.That means shorter queues, fewer idle GPUs, and less performance falloff due to contention.

Of course, time-slicing isn't new.

The idea of giving each job a time window on the GPU dates back to the earliest multitasking OS models. What gPooling does differently is combine time-slicing with utilization feedback. It watches how each vGPU is using the card, how often it actually runs kernels, how much memory it touches—and dynamically adjusts future time slices based on that behavior.

If a job spikes past its assigned limit, its next slice gets shorter. If it idles, more time goes to its neighbors. It's not perfect fairness, but it's super efficient because it reacts in real time.

And crucially for scientific work, this control happens without touching the applications themselves. No code changes, Docker wrapping. Just a better backend.

Ah. But you're worried it might break your tooling.

That's the first question every lab manager or infrastructure lead will ask.

gPooling integrates with Slurm using its existing GRES (generic resource) framework. No source-code changes to Slurm or rewriting of job scripts. You register vGPUs the same way you'd register nodes or memory tiers. Users submit jobs exactly as they always have except now, they're targeting fractional GPUs instead of whole ones and if all goes well t’s invisible until you want to tune it.

That means you can run your TensorFlow pipelines, your AlphaFold batch jobs, your JAX-based physics-informed models, your Torch drug classifiers—all without worrying whether they’ll break. And when you're ready to scale up? The same system works whether you're sharing one A100 or slicing dozens.

But performance ain't ever free.

Every layer of virtualization adds some cost. In this case, that comes in two forms, the first you'll notice are longer task runtimes (because each job is running on less than a full GPU), and occasional slowdowns due to context switching. But in real-world tests (say on workloads like ResNet training, GROMACS simulation, and even LLM inference) the tradeoffs tilted in favor of pooling.

Why? Because wait times collapsed.

In clusters where the queue is the bottleneck, not compute throughput, the ability to start more jobs sooner matters more than any small per-job slowdown. And in life sciences, where compute jobs often come in waves with some long-ass dry spells, followed by furious experimentation this bursty elasticity is what keeps research flowing.

In fact, in some LLM use cases with tight memory limits, pooling even improved performance: the memory cap forced smarter kv-cache management and batch reuse, cutting total execution time. You're welcome.

How to Share a GPU Without Starting a War

MJ Horowitz

Read more

A Million CPU Hours Later, AI Cracks Open the T-Cell Code

Iambic Rides Blackwell GPUs on Lambda Cloud

The Suburban Broadband Tech That Could Shift Life Sciences Datacenters

An AI Datacenter Blueprint for Drug Discovery at Scale