FreeViS Project Page

FreeViS leverages multiple stylized references to accomplish high-quality video stylization without training.

Video Stylization

Demos for Cartoon Styles

Demos for Ink Wash Painting

Demos for Other Styles

Arbitrary Style Transfer

⚠️ If the content and stylized videos are unsynchronized, click the style images again. ⚠️

Before

After

Before

After

Before

After

Stylized Video Generation

Prompt: "Songbirds splashing and flapping in a birdbath."

Style:

Prompt: "A child riding a bicycle down a suburban street with training wheels."

Style:

Prompt: "A waterfall splashing into a rocky pool surrounded by mossy cliffs."

Style:

Prompt: "A golden retriever running through a grassy field on a windy day."

Style:

Prompt: "Meadow flowers nodding gently in the wind."

Style:

Prompt: "Smoke slowly rising from a chimney in a silent mountain village."

Style:

Prompt: "A woman is vacuuming the living room floor."

Style:

Prompt: "A lone rider gallops across an open, sandy plain under a clear blue sky."

Style:

Prompt: "A group of kids drawing on the sidewalk with colorful chalk."

Style:

Methodology

FreeViS successfully resolves the propagation error observed in previous methods with single reference input.

(1) Indirect High-frequency Compensation: The inverted noise alone is insufficient to faithfully reconstruct the original video content, often resulting in the loss of structural details in stylized outputs. To address this, we extract high-frequency components from the reconstruction residuals of PnP inversion to enhance and constrain the spatial layout and motion trajectories, ensuring better content preservation.

(2) Additional Inconsistent References: To alleviate propagation errors, we incorporate multiple stylized reference frames into the base model. However, the base I2V model only support single reference input, and directly concatenating additional reference frames with noise latents often introduces flickering and stuttering artifacts. To mitigate these issues, we propose an isolated attention mechanism combined with a flow-based masking strategy to handle stylization inconsistencies. Furthermore, an appearance–dynamic decomposition module is introduced to re-inject dynamic residuals into static reference latents, improving temporal coherence and style stability.

Diagram showing text, image, and video tokens being combined

Our framework builds upon a pretrained I2V diffusion model, which serves as the backbone. Given an arbitrary style image, stylized references are generated via an image style transfer model applied to several selected content video frames. We leverage inversion to recover the denoising trajectory and initial noise. The pipeline comprises two branches: reconstruction and stylization.

(3) Explicit Optical Flow Guidance: In plain regions with few salient features, style textures may vanish under large camera motion. To overcome this, we propose to leverage optical flow to constrain diffused attention areas in associated frames (especially distant frames).

Citation

If you find our work useful for your research, please consider citing our paper:

@misc{xu2025freevistrainingfreevideostylization,
      title={FreeViS: Training-free Video Stylization with Inconsistent References}, 
      author={Jiacong Xu and Yiqun Mei and Ke Zhang and Vishal M. Patel},
      year={2025},
      eprint={2510.01686},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2510.01686}, 
}

Video Stylization

Arbitrary Style Transfer

Stylized Video Generation

Methodology

Citation

This template is modified based on EditVerse.