Shortcut Models

Kevin Frans, Danijar Hafner, Sergey Levine, Pieter Abeel
UC Berkeley

We develop a plug-and-play replacement for diffusion models that can generate samples in a single step.

Abstract

Diffusion models and flow-matching models have enabled generating diverse and realistic images by learning to transfer noise to data. However, sampling from these models involves iterative denoising over many neural network passes, making generation slow and expensive. Previous approaches for speeding up sampling require complex training regimes, such as multiple training phases, multiple networks, or fragile scheduling. We introduce shortcut models, a family of generative models that use a single network and training phase to produce high-quality samples in a single or multiple sampling steps. Shortcut models condition the network not only on the current noise level but also on the desired step size, allowing the model to skip ahead in the generation process. Across a wide range of sampling step budgets, shortcut models consistently produce higher quality samples than previous approaches, such as consistency models and reflow. Compared to distillation, shortcut models reduce complexity to a single network and training phase and allow varying step budgets at inference time.

Naive diffusion and flow-matching models fail at few-step generation. Left: Training paths are created by randomly pairing data and noise. Note that the paths overlap; there is inherent uncertainty about the direction vt to the data point, given only xt. Right: While flow-matching models learn a deterministic ODE, its paths are not straight and have to be followed closely. The predicted directions vt point towards the average of plausible data points. The fewer inference steps, the more the generations are biased towards the dataset mean, causing them to go off track. At the first sampling step, the model points towards the dataset mean and thus cannot generate multi-modal data in a single step (see red circles).

Overview of shortcut model training. At d ≈ 0, the shortcut objective is equivalent to the flow-matching objective, and can be trained by regressing onto empirical E[vt|xt] samples. Targets for larger d shortcuts are constructed by concatenating a sequence of two d/2 shortcuts. Both objectives can be trained jointly; shortcut models do not require a two-stage procedure or discretization schedule.

Behavior of flow-matching and shortcut models over decreasing numbers of denoising steps. While naive flow-matching models incur degradation and mode collapse, shortcut models are able to maintain a similar sample distribution at fewand one-step generation. This capability comes at no expense to generation quality under large inference budget.

One-step generation quality continues to improve as model parameter count increases. While generative models tend to display continual improvement with model scale, bootstrap-based methods such as Q-learning have been shown to lose this property. We show that shortcut models, while a bootstrap-based method, retains the ability to scale accuracy with model size.

Shortcut models can represent multimodal policies in a similar manner as diffusion policies, while reducing the number of denoising steps to 1. Shown on the left are trajectories from the Push-T (top) and Transport (bottom) tasks. In each case, a generative model is trained on human demonstrations, and the model is queried to produce actions given a set of past observations.

Code and model checkpoints available on Github.

Kevin Frans, Danijar Hafner, Sergey Levine, Pieter Abeel UC Berkeley

Abstract

Kevin Frans, Danijar Hafner, Sergey Levine, Pieter Abeel
UC Berkeley