Stability-AI · voletiv · May 20, 2025 · Apr 3, 2025 · Apr 3, 2025 · Apr 4, 2025
diff --git a/.gitignore b/.gitignore
@@ -12,4 +12,5 @@
 /outputs
 /build
 /src
-/.vscode
+/.vscode
+**/__pycache__/
diff --git a/README.md b/README.md
@@ -5,6 +5,46 @@
 ## News
 
 
+**April 4, 2025**
+- We are releasing **[Stable Video 4D 2.0 (SV4D 2.0)](https://huggingface.co/stabilityai/sv4d2.0)**, an enhanced video-to-4D diffusion model for high-fidelity novel-view video synthesis and 4D asset generation. For research purposes:
+    - **SV4D 2.0** was trained to generate 48 frames (12 video frames x 4 camera views) at 576x576 resolution, given a 12-frame input video of the same size, ideally consisting of white-background images of a moving object.
+    - Compared to our previous 4D model [SV4D](https://huggingface.co/stabilityai/sv4d), **SV4D 2.0** can generate videos with higher fidelity, sharper details during motion, and better spatio-temporal consistency. It also generalizes much better to real-world videos. Moreover, it does not rely on refernce multi-view of the first frame generated by SV3D, making it more robust to self-occlusions.
+    - To generate longer novel-view videos, we autoregressively generate 12 frames at a time and use the previous generation as conditioning views for the remaining frames.
+    - Please check our [project page](https://sv4d20.github.io), [arxiv paper](https://arxiv.org/pdf/2503.16396) and [video summary](https://www.youtube.com/watch?v=dtqj-s50ynU) for more details.
+
+**QUICKSTART** :
+- `python scripts/sampling/simple_video_sample_4d2.py --input_path assets/sv4d_videos/camel.gif --output_folder outputs` (after downloading [sv4d2.safetensors](https://huggingface.co/stabilityai/sv4d2.0) from HuggingFace into `checkpoints/`)
+
+To run **SV4D 2.0** on a single input video of 21 frames:
+- Download SV4D 2.0 model (`sv4d2.safetensors`) from [here](https://huggingface.co/stabilityai/sv4d2.0) to `checkpoints/`: `huggingface-cli download stabilityai/sv4d2.0 sv4d2.safetensors --local-dir checkpoints`
+- Run inference: `python scripts/sampling/simple_video_sample_4d2.py --input_path <path/to/video>`
+    - `input_path` : The input video `<path/to/video>` can be
+      - a single video file in `gif` or `mp4` format, such as `assets/sv4d_videos/camel.gif`, or
+      - a folder containing images of video frames in `.jpg`, `.jpeg`, or `.png` format, or
+      - a file name pattern matching images of video frames.
+    - `num_steps` : default is 50, can decrease to it to shorten sampling time.
+    - `elevations_deg` : specified elevations (reletive to input view), default is 0.0 (same as input view).
+    - **Background removal** : For input videos with plain background, (optionally) use [rembg](https://github.com/danielgatis/rembg) to remove background and crop video frames by setting `--remove_bg=True`. To obtain higher quality outputs on real-world input videos with noisy background, try segmenting the foreground object using [Clipdrop](https://clipdrop.co/) or [SAM2](https://github.com/facebookresearch/segment-anything-2) before running SV4D.
+    - **Low VRAM environment** : To run on GPUs with low VRAM, try setting `--encoding_t=1` (of frames encoded at a time) and `--decoding_t=1` (of frames decoded at a time) or lower video resolution like `--img_size=512`.
+
+Notes:
+- We also train a 8-view model that generates 5 frames x 8 views at a time (same as SV4D).
+  - Download the model from huggingface: `huggingface-cli download stabilityai/sv4d2.0 sv4d2_8views.safetensors --local-dir checkpoints`
+  - Run inference: `python scripts/sampling/simple_video_sample_4d2.py --model_path checkpoints/sv4d2_8views.safetensors --input_path assets/sv4d_videos/chest.gif --output_folder outputs`
+  - The 5x8 model takes 5 frames of input at a time. But the inference scripts for both model take 21-frame video as input by default (same as SV3D and SV4D), we run the model autoregressively until we generate 21 frames.
+- Install dependencies before running:
+```
+python3.10 -m venv .generativemodels
+source .generativemodels/bin/activate
+pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118 # check CUDA version
+pip3 install -r requirements/pt2.txt
+pip3 install .
+pip3 install -e git+https://github.com/Stability-AI/datapipelines.git@main#egg=sdata
+```
+
+  ![tile](assets/sv4d2.gif)
+
+
 **July 24, 2024**
 - We are releasing **[Stable Video 4D (SV4D)](https://huggingface.co/stabilityai/sv4d)**, a video-to-4D diffusion model for novel-view video synthesis. For research purposes:
     - **SV4D** was trained to generate 40 frames (5 video frames x 8 camera views) at 576x576 resolution, given 5 context frames (the input video), and 8 reference views (synthesised from the first frame of the input video, using a multi-view diffusion model like SV3D) of the same size, ideally white-background images with one object.
@@ -164,6 +204,7 @@ This is assuming you have navigated to the `generative-models` root after clonin
 # install required packages from pypi
 python3 -m venv .pt2
 source .pt2/bin/activate
+pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118
 pip3 install -r requirements/pt2.txt
 ```
 

diff --git a/assets/sv4d2.gif b/assets/sv4d2.gif
diff --git a/assets/sv4d_videos/bear.gif b/assets/sv4d_videos/bear.gif
diff --git a/assets/sv4d_videos/bee.gif b/assets/sv4d_videos/bee.gif
diff --git a/assets/sv4d_videos/bmx-bumps.gif b/assets/sv4d_videos/bmx-bumps.gif
diff --git a/assets/sv4d_videos/bunnyman.mp4 b/assets/sv4d_videos/bunnyman.mp4
diff --git a/assets/sv4d_videos/camel.gif b/assets/sv4d_videos/camel.gif
diff --git a/assets/sv4d_videos/chameleon.gif b/assets/sv4d_videos/chameleon.gif
diff --git a/assets/sv4d_videos/chest.gif b/assets/sv4d_videos/chest.gif
diff --git a/assets/sv4d_videos/cows.gif b/assets/sv4d_videos/cows.gif
diff --git a/assets/sv4d_videos/dance-twirl.gif b/assets/sv4d_videos/dance-twirl.gif
diff --git a/assets/sv4d_videos/dolphin.mp4 b/assets/sv4d_videos/dolphin.mp4
diff --git a/assets/sv4d_videos/flag.gif b/assets/sv4d_videos/flag.gif
diff --git a/assets/sv4d_videos/gear.gif b/assets/sv4d_videos/gear.gif
diff --git a/assets/sv4d_videos/green_robot.mp4 b/assets/sv4d_videos/green_robot.mp4
diff --git a/assets/sv4d_videos/guppie_v0.mp4 b/assets/sv4d_videos/guppie_v0.mp4
diff --git a/assets/sv4d_videos/hike.gif b/assets/sv4d_videos/hike.gif
diff --git a/assets/sv4d_videos/hiphop_parrot.mp4 b/assets/sv4d_videos/hiphop_parrot.mp4
diff --git a/assets/sv4d_videos/horsejump-low.gif b/assets/sv4d_videos/horsejump-low.gif
diff --git a/assets/sv4d_videos/human5.mp4 b/assets/sv4d_videos/human5.mp4
diff --git a/assets/sv4d_videos/human7.mp4 b/assets/sv4d_videos/human7.mp4
diff --git a/assets/sv4d_videos/lucia_v000.mp4 b/assets/sv4d_videos/lucia_v000.mp4
diff --git a/assets/sv4d_videos/monkey.mp4 b/assets/sv4d_videos/monkey.mp4
diff --git a/assets/sv4d_videos/pistol_v0.mp4 b/assets/sv4d_videos/pistol_v0.mp4
diff --git a/assets/sv4d_videos/robot.gif b/assets/sv4d_videos/robot.gif
diff --git a/assets/sv4d_videos/snowboard.gif b/assets/sv4d_videos/snowboard.gif
diff --git a/assets/sv4d_videos/snowboard_v000.mp4 b/assets/sv4d_videos/snowboard_v000.mp4
diff --git a/assets/sv4d_videos/stroller_v000.mp4 b/assets/sv4d_videos/stroller_v000.mp4
diff --git a/assets/sv4d_videos/test_video2.mp4 b/assets/sv4d_videos/test_video2.mp4
diff --git a/assets/sv4d_videos/train_v0.mp4 b/assets/sv4d_videos/train_v0.mp4
diff --git a/assets/sv4d_videos/wave_hello.mp4 b/assets/sv4d_videos/wave_hello.mp4
diff --git a/assets/sv4d_videos/windmill.gif b/assets/sv4d_videos/windmill.gif
diff --git a/requirements/pt2.txt b/requirements/pt2.txt
@@ -5,13 +5,16 @@ einops>=0.6.1
 fairscale>=0.4.13
 fire>=0.5.0
 fsspec>=2023.6.0
+imageio[ffmpeg]
+imageio[pyav]
 invisible-watermark>=0.2.0
 kornia==0.6.9
 matplotlib>=3.7.2
 natsort>=8.4.0
 ninja>=1.11.1
-numpy>=1.24.4
+numpy==2.1
 omegaconf>=2.3.0
+onnxruntime
 open-clip-torch>=2.20.0
 opencv-python==4.6.0.66
 pandas>=2.0.3