SAC Flow: Sample-Efficient Reinforcement Learning of Flow-Based Policies via Velocity-Reparameterized Sequential Modeling

Yixian Zhang1*, Shu'ang Yu1,5*, Tonghe Zhang2, Mo Guang3, Haojia Hui3, Kaiwen Long3, Yu Wang1, Chao Yu1,4†, Wenbo Ding1†
1Tsinghua University 2Carnegie Mellon University 3Li Auto
4Zhongguancun Academy 5Shanghai AI Laboratory
*Equal contribution Corresponding Authors
Code arXiv
SAC Flow Overview

Overview of SAC Flow. The multi-step sampling process of flow-based policies frequently causes exploding gradients during off-policy RL updates. Our key insight is to treat the flow-based policy as a sequential model, for which we first demonstrate an algebraic equivalence to an RNN. We then reparameterize the flow's velocity network using modern sequential architectures (e.g., GRU, Transformer).

Abstract

Training expressive flow-based policies with off-policy reinforcement learning is notoriously unstable due to gradient pathologies in the multi-step action sampling process. We trace this instability to a fundamental connection: the flow rollout is algebraically equivalent to a residual recurrent computation, making it susceptible to the same vanishing and exploding gradients as RNNs. To address this, we reparameterize the velocity network using principles from modern sequential models, introducing two stable architectures: Flow-G, which incorporates a gated velocity, and Flow-T, which utilizes a decoded velocity. We then develop a practical SAC-based algorithm, enabled by a noise-augmented rollout, that facilitates direct end-to-end training of these policies. Our approach supports both from-scratch and offline-to-online learning and achieves state-of-the-art performance on continuous control and robotic manipulation benchmarks, eliminating the need for common workarounds like policy distillation or surrogate objectives.

Method

Flow-based policies have shown strong potential on challenging continuous-control tasks, including robot manipulation, due to their ability to represent rich, multimodal action distributions. However, training these policies with off-policy reinforcement learning is notoriously unstable.

Our key insight is to treat the flow-based policy as a sequential model. We show that the Euler integration used to generate actions in the flow-based policy is algebraically identical to the recurrent computation of a residual RNN, shown in (a). This observation explains the instability observed with off-policy training: the same vanishing or exploding gradients known to affect RNNs also afflict the flow rollout.

Flow-based Policy Architectures

Velocity network parameterizations. (a) Standard flow as RNN (b) Flow-G with GRU-style gating (c) Flow-T with Transformer decoder

Introducing a gate network leads to Flow-G in (b) and improves gradient stability. Replacing the velocity with the normalized residual block in (v) yields Flow-T. This architecture provides well-conditioned depth and, crucially, aggregates context with the well-established Transformer architectures. These parameterizations serve as drop-in replacements for velocity in flow policy without altering the surrounding algorithm. As a result, they enable direct and stable off-policy training with methods such as SAC, remove the need for auxiliary distillation actors and surrogate objectives, and keep flow rollout efficient at test time.

Experimental Results

Experimental Domain

Exp_envs

From-Scratch Training Performance

We evaluate our method on MuJoCo continuous control tasks for from-scratch learning. Our SAC Flow-G and SAC Flow-T consistently outperform all baselines across all tasks, achieving significant sample efficiency and convergence stability.

From-scratch training results

From-scratch training performance on MuJoCo tasks. Our methods achieve state-of-the-art performance across all environments. Meanwhile, in the sparse reward environment, all methods fail, where the offline-to-online training is essential.

Offline-to-Online Training Performance

For challenging sparse-reward tasks, we evaluate on OGBench and Robomimic benchmarks using offline-to-online training. Our methods achieve rapid convergence and state-of-the-art success rates, particularly excelling in complex manipulation tasks.

Offline-to-online training results

Offline-to-online performance aggregated across OGBench and Robomimic tasks. Each curve shows mean success rate across multiple task instances.

Ablation Studies

Gradient norm ablation

Gradient Stability. Our methods significantly reduce gradient exploding compared to naive SAC Flow.

Key Contributions

🧠 Sequential Model Perspective

We formalize the K-step flow rollout as a residual RNN computation, providing theoretical explanation for gradient pathologies and enabling reparameterization with modern sequential architectures.

⚡ Practical SAC Framework

We develop SAC Flow, a robust off-policy algorithm with noise-augmented rollout for tractable likelihood computation, supporting both from-scratch and offline-to-online training.

🏆 State-of-the-Art Performance

Our methods achieve superior sample efficiency and performance across MuJoCo, OGBench, and Robomimic benchmarks, significantly outperforming recent flow- and diffusion-based baselines.

BibTeX

@article{sacflow,
      title={SAC Flow: Sample-Efficient Reinforcement Learning of Flow-Based Policies via Velocity-Reparameterized Sequential Modeling}, 
      author={Yixian Zhang and Shu'ang Yu and Tonghe Zhang and Mo Guang and Haojia Hui and Kaiwen Long and Yu Wang and Chao Yu and Wenbo Ding},
      year={2025},
      eprint={2509.25756},
      archivePrefix={arXiv},
      url={https://arxiv.org/abs/2509.25756}, 
}