MinT: A New Way to Manage LLM Versions

AI services rarely run a single monolithic base model. Even from the same foundation, organizations continuously spin up purpose-specific variants: a legal team might run a version fine-tuned on legal data, while customer A gets their own dedicated version, and customer B gets theirs. On top of that, you need yesterday’s experimental version, today’s, and rollback points for when things go wrong.

Until now, managing all these variants typically meant treating each one as a separate full checkpoint. A 30B-parameter model stored in 16-bit format weighs in at roughly 60 GB of weights alone, so with just 100 variants, you’re already looking at 6 TB. Moving, loading, and caching those checkpoints between training and serving servers means not just storage costs, but network transfer time, GPU memory pressure, and loading latency. The more versions you have, the slower deployments become, and the harder it gets to run concurrent experiments or roll back quickly.

So the real question is: do you actually need to rebuild and ship the entire model every time?

What if you could keep the large base model in place, and just swap out the parts that change — like interchangeable components?

That’s exactly the idea behind MinT (Managed Infrastructure for Training and Serving Millions of LLMs).

Rethinking What LoRA Is For

LoRA (Low-Rank Adaptation) is a technique that trains only a small set of low-rank matrices rather than updating the entire model, essentially attaching a small add-on to the base model to produce the desired behavior, rather than rewriting the model itself.

What makes MinT different isn’t LoRA itself, but how it treats LoRA. Rather than viewing LoRA as simply a lightweight fine-tuning technique, MinT uses LoRA adapters as the fundamental unit of operations, spanning training, evaluation, rollout generation, serving, and rollback.

To see why this matters, consider the alternatives. Full fine-tuning creates a complete model checkpoint for every variant. Merge-based LoRA trains lightly with LoRA but then folds the adapter back into the base model before serving, so at inference time, you’re still dealing with a full-weight checkpoint. The serving-time burden doesn’t go away.

MinT takes a different path. The base model stays resident on the inference worker at all times. When training finishes, MinT doesn’t build a new full model. It exports only the LoRA adapter capturing what changed, and attaches that as an adapter revision to the already-resident serving engine. What crosses the training-serving boundary isn’t a full model; it’s a small, versioned adapter.

What Moves Between Training and Serving

The Two Core Concepts in MinT

To understand MinT, two concepts are central.

An adapter revision is a fixed, exported snapshot of a LoRA adapter, frozen at a specific training step and stored in serving tensor layout. It’s the unit selected for rollout, evaluation, online serving, and rollback: the definitive LoRA file that produces this behavior.

But an adapter file alone isn’t enough to run a service. You also need to track which base model it’s compatible with, what configuration it was trained under, where the latest training state lives, what rollout records it carries, and which revisions are ready to serve. MinT stores all of this in a policy record, the service-owned lifecycle state that makes a behavior reproducible, resumable, and rollbackable.

The distinction is clean: an adapter revision is the executable unit carrying a behavior; a policy record is the management state that makes that behavior trainable, evaluable, servable, and recoverable. Together, they let MinT manage vast numbers of model versions not as full checkpoints, but as small adapters paired with metadata.

MinT overview

As shown in the paper’s overview diagram, when a user specifies a base model, data and reward signals, a LoRA training recipe, and evaluation or serving targets, MinT accepts these through an API, queues the work, and manages everything through policy records and adapter revisions. Infrastructure concerns like scheduling, fault recovery, and cache management are all handled internally by the service.

Stay ahead in AI

Stay ahead in AI / Subscribe

Three Problems MinT Sets Out to Solve

1. Scale Down

The first axis is minimizing what moves between training and serving, replacing full model checkpoints with small adapter revisions. In the traditional workflow, you’d finish training, materialize a full checkpoint, ship it to the serving cluster, and reload it. MinT instead attaches a compact adapter revision to the serving engine that already holds the base model.

Handoff time: adapter-only vs. merge-and-load

According to the paper’s measurements, this reduces the training-to-serving handoff time by 18.3× on the Qwen3-4B model and 2.85× on the Qwen3-30B MoE model compared to the merge-based approach.

2. Scale Up

The researchers also validate that this architecture holds up beyond small models — including large dense models and Mixture-of-Experts (MoE) architectures. MoE models are particularly tricky: rather than using all parameters at every step, they route each token through a selected subset of “experts.” If the routing paths differ between rollout and training, the learning signal becomes unstable.

MinT addresses this by recording expert routing decisions with rollout data, and masking out tokens from the training gradient when those routing paths can’t be reliably replayed — keeping the training signal consistent even across sparse-routing models.

3. Scale Out

Finally, the paper examines how to manage large populations of adapter policies in a live service. The key insight here is that “managing many adapters” doesn’t mean “loading all adapters onto GPUs simultaneously.” MinT maintains a large named catalog of adapter policies, but only promotes frequently-used or currently-requested adapters into CPU cache or GPU execution slots.

This introduces a concrete challenge: if a requested adapter is already cached, it can be served immediately; but if it’s new or was evicted, it has to be fetched from storage and registered with the serving engine before inference can begin. For MoE LoRA adapters in particular, this process can involve an enormous number of small tensor fragments — not because the files are large, but because the overhead of reading and registering thousands of tiny objects creates a bottleneck. The researchers addressed this by packing the adapter’s 37,248 tensor objects down to 672, achieving an 8.5–8.7× speedup in live engine loading.

Why This Research Matters

The results are impressive, but worth reading carefully. The paper’s title, “Millions of LLMs,” doesn’t mean millions of full models running simultaneously on GPUs. It means maintaining a large named catalog of adapter policies and loading only the ones needed at any given moment. Similarly, the 8.5–8.7× speedup applies specifically to the engine loading step on cache misses, not to end-to-end user-facing latency.

That said, the paper’s contribution is genuinely interesting. MinT reframes LoRA not merely as a memory-efficient fine-tuning trick, but as an infrastructure primitive that spans the full operational lifecycle: training, evaluation, serving, and rollback.

The traditional workflow of improving a model, producing a new checkpoint, and redeploying made sense when model variants were few and static. But as models grow larger and services demand more versions across tenants, tasks, and experiments, that approach breaks down. MinT proposes a different organizing principle: share the base model, and manage changing behaviors as adapters. The system owns the question of where each adapter lives, what state it’s in, and when it’s ready to serve.

In this framing, LoRA stops being a training shortcut and becomes a service-level unit for large-scale post-training infrastructure. As AI services grow more personalized, more frequently updated, and more fragmented across organizations and use cases, adapter-centric operations seem likely to become increasingly important. Competitive advantage may increasingly come not from having the biggest model, but from how efficiently and reliably you can operate the many behavioral variants that grow on top of it.