FreeViS: Training-free Video Stylization with Inconsistent References

1Johns Hopkins University 2Adobe Research
FreeViS leverages multiple stylized references to accomplish high-quality video stylization without training.

Methodology

FreeViS successfully resolves the propagation error observed in previous methods with single reference input.

(1) Indirect High-frequency Compensation: The inverted noise alone is insufficient to faithfully reconstruct the original video content, often resulting in the loss of structural details in stylized outputs. To address this, we extract high-frequency components from the reconstruction residuals of PnP inversion to enhance and constrain the spatial layout and motion trajectories, ensuring better content preservation.

(2) Additional Inconsistent References: To alleviate propagation errors, we incorporate multiple stylized reference frames into the base model. However, the base I2V model only support single reference input, and directly concatenating additional reference frames with noise latents often introduces flickering and stuttering artifacts. To mitigate these issues, we propose an isolated attention mechanism combined with a flow-based masking strategy to handle stylization inconsistencies. Furthermore, an appearance–dynamic decomposition module is introduced to re-inject dynamic residuals into static reference latents, improving temporal coherence and style stability.

Diagram showing text, image, and video tokens being combined Our framework builds upon a pretrained I2V diffusion model, which serves as the backbone. Given an arbitrary style image, stylized references are generated via an image style transfer model applied to several selected content video frames. We leverage inversion to recover the denoising trajectory and initial noise. The pipeline comprises two branches: reconstruction and stylization.

(3) Explicit Optical Flow Guidance: In plain regions with few salient features, style textures may vanish under large camera motion. To overcome this, we propose to leverage optical flow to constrain diffused attention areas in associated frames (especially distant frames).

Citation

If you find our work useful for your research, please consider citing our paper:

@misc{xu2025freevistrainingfreevideostylization,
      title={FreeViS: Training-free Video Stylization with Inconsistent References}, 
      author={Jiacong Xu and Yiqun Mei and Ke Zhang and Vishal M. Patel},
      year={2025},
      eprint={2510.01686},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2510.01686}, 
}
             

This template is modified based on EditVerse.