Abstract
Training expressive flow-based policies with off-policy reinforcement learning is notoriously unstable due to gradient pathologies in the multi-step action sampling process. We trace this instability to a fundamental connection: the flow rollout is algebraically equivalent to a residual recurrent computation, making it susceptible to the same vanishing and exploding gradients as RNNs. To address this, we reparameterize the velocity network using principles from modern sequential models, introducing two stable architectures: Flow-G, which incorporates a gated velocity, and Flow-T, which utilizes a decoded velocity. We then develop a practical SAC-based algorithm, enabled by a noise-augmented rollout, that facilitates direct end-to-end training of these policies. Our approach supports both from-scratch and offline-to-online learning and achieves state-of-the-art performance on continuous control and robotic manipulation benchmarks, eliminating the need for common workarounds like policy distillation or surrogate objectives.
Method
Flow-based policies have shown strong potential on challenging continuous-control tasks, including robot manipulation, due to their ability to represent rich, multimodal action distributions. However, training these policies with off-policy reinforcement learning is notoriously unstable.
Our key insight is to treat the flow-based policy as a sequential model. We show that the Euler integration used to generate actions in the flow-based policy is algebraically identical to the recurrent computation of a residual RNN, shown in (a). This observation explains the instability observed with off-policy training: the same vanishing or exploding gradients known to affect RNNs also afflict the flow rollout.
Velocity network parameterizations. (a) Standard flow as RNN (b) Flow-G with GRU-style gating (c) Flow-T with Transformer decoder
Introducing a gate network leads to Flow-G in (b) and improves gradient stability. Replacing the velocity with the normalized residual block in (v) yields Flow-T. This architecture provides well-conditioned depth and, crucially, aggregates context with the well-established Transformer architectures. These parameterizations serve as drop-in replacements for velocity in flow policy without altering the surrounding algorithm. As a result, they enable direct and stable off-policy training with methods such as SAC, remove the need for auxiliary distillation actors and surrogate objectives, and keep flow rollout efficient at test time.
Experimental Results
Experimental Domain
From-Scratch Training Performance
We evaluate our method on MuJoCo continuous control tasks for from-scratch learning. Our SAC Flow-G and SAC Flow-T consistently outperform all baselines across all tasks, achieving significant sample efficiency and convergence stability.
From-scratch training performance on MuJoCo tasks. Our methods achieve state-of-the-art performance across all environments. Meanwhile, in the sparse reward environment, all methods fail, where the offline-to-online training is essential.
Offline-to-Online Training Performance
For challenging sparse-reward tasks, we evaluate on OGBench and Robomimic benchmarks using offline-to-online training. Our methods achieve rapid convergence and state-of-the-art success rates, particularly excelling in complex manipulation tasks.
Offline-to-online performance aggregated across OGBench and Robomimic tasks. Each curve shows mean success rate across multiple task instances.
Ablation Studies
Gradient Stability. Our methods significantly reduce gradient exploding compared to naive SAC Flow.
Key Contributions
🧠 Sequential Model Perspective
We formalize the K-step flow rollout as a residual RNN computation, providing theoretical explanation for gradient pathologies and enabling reparameterization with modern sequential architectures.
⚡ Practical SAC Framework
We develop SAC Flow, a robust off-policy algorithm with noise-augmented rollout for tractable likelihood computation, supporting both from-scratch and offline-to-online training.
🏆 State-of-the-Art Performance
Our methods achieve superior sample efficiency and performance across MuJoCo, OGBench, and Robomimic benchmarks, significantly outperforming recent flow- and diffusion-based baselines.
BibTeX
@article{sacflow,
title={SAC Flow: Sample-Efficient Reinforcement Learning of Flow-Based Policies via Velocity-Reparameterized Sequential Modeling},
author={Yixian Zhang and Shu'ang Yu and Tonghe Zhang and Mo Guang and Haojia Hui and Kaiwen Long and Yu Wang and Chao Yu and Wenbo Ding},
year={2025},
eprint={2509.25756},
archivePrefix={arXiv},
url={https://arxiv.org/abs/2509.25756},
}