[CVPR 25 Highlight] Which Viewpoint Shows it Best? Language for Weakly Supervising View Selection in Multi-view Instructional Videos
This repository will very soon contain the PyTorch code for our CVPR 2025 paper and the associated datasets:
Which Viewpoint Shows it Best? Language for Weakly Supervising View Selection in Multi-view Instructional Videos
Sagnik Majumder, Tushar Nagarajan, Ziad Al-Halah, Reina Pradhan, Kristen Grauman
Project website: https://vision.cs.utexas.edu/projects/lang-view
Given a multi-view video, which viewpoint is most informative for a human observer? Existing methods rely on heuristics or expensive "best-view" supervision to answer this question, limiting their applicability. We propose a weakly supervised approach that leverages language accompanying an instructional multi-view video as a means to recover its most informative viewpoint(s). Our key hypothesis is that the more accurately an individual view can predict a view-agnostic text summary, the more informative it is. To put this into action, we propose LangView, a framework that uses the relative accuracy of view-dependent caption predictions as a proxy for best view pseudo-labels. Then, those pseudo-labels are used to train a view selector, together with an auxiliary camera pose predictor that enhances view-sensitivity. During inference, our model takes as input only a multi-view video--no language or camera poses--and returns the best viewpoint to watch at each timestep. On two challenging datasets comprised of diverse multi-camera setups and how-to activities, our model consistently outperforms state-of-the-art baselines, both with quantitative metrics and human evaluation.
Code✅Env setup instructions✅Train and test bash commands for Ego-Exo4D✅Checkpoint release for Ego-Exo4D✅Data release for Ego-Exo4D✅Auto-metric eval scripts✅✅train.py,test.pyfor LEMMA- Auto-metric eval scripts for LEMMA
Train and test bash commands for LEMMA✅Checkpoint release for LEMMA✅Data release for LEMMA✅
This code has been tested with python 3.9.18 with torch 2.2.2+cu121 and torchvision 0.17.2+cu121. Additional python package requirements are available in requirements.txt.
Install the remaining dependencies either by
pip3 install -r requirements.txt
or by parsing requirements.txt to get the names and versions of individual dependencies and install them individually.
Download the data segments from this link, copy them to the repo root and run the following commands:
cat data_part_* > data.tar
tar -xvf data.tar
For LEMMA frames, download the data from the dataset website and link data/lemma/datapoint_images to the data-002 directory in the downloaded data directory.
Download the EgoVLPv2 pretrained checkpoint from this link and put it at this path: pretrained_checkpoints/egovlpV2_model_best_egoExo30nov2024.pth.
Download the Lang-View checkpoint directory from this link and put it at this path: runs
python3 train.py --run-dir runs/egoExo4d_release --log-tb --data-parallel --use-datapointVideoClips --randomize-trainViewOrder --unfreeze-videoEncoder --use-minMultiHotLoss --trainDatapoints-filePath data/labels/train/videoLlama_cider_all3Agree.pkl,data/labels/train/videoLlamaWvicuna_cider_all3Agree.pkl,data/labels/train/videoChat2_cider_all3Agree.pkl --valDatapoints-filePath data/labels/val/videoLlama_cider_all3Agree.pkl,data/labels/val/videoLlamaWvicuna_cider_all3Agree.pkl,data/labels/val/videoChat2_cider_all3Agree.pkl --multiBestViewAggregator-multiPseudoLabler --use-relativeCameraPoseLoss --maskOut-invalidRelativeCameraPoseLoss-inTraining --relativeCameraPoseLoss-rotationInAngles --relativeCameraPoseLoss-rotationAsClasses --relativeCameraPoseLoss-coordsInAngles --relativeCameraPoseLoss-coordsAsClasses
python3 test.py --run-dir runs/egoExo4d_release --data-parallel --use-datapointVideoClips --unfreeze-videoEncoder --use-relativeCameraPoseLoss --relativeCameraPoseLoss-rotationInAngles --relativeCameraPoseLoss-rotationAsClasses --relativeCameraPoseLoss-coordsInAngles --relativeCameraPoseLoss-coordsAsClasses
To compute auto-metrics, run the following scripts: scripts/ego_exo4d/format_predictedVIewScores.ipynb, scripts/ego_exo4d/run_captioningMetrics.py and scripts/ego_exo4d/compute_captioningScores.ipynb one after the other.
python3 train_lemma.py --run-dir runs/lemma_release --data-parallel --isLemma-dataset --use-datapointVideoClips --randomize-trainViewOrder --unfreeze-videoEncoder --use-minMultiHotLoss --trainDatapoints-filePath data/lemma/labels/train/videoLlama_cider_all3Agree.pkl,data/lemma/labels/train/videoLlamaWvicuna_cider_all3Agree.pkl,data/lemma/labels/train/videoChat2_cider_all3Agree.pkl --valDatapoints-filePath data/lemma/labels/val/videoLlama_cider_all3Agree.pkl,data/lemma/labels/val/videoLlamaWvicuna_cider_all3Agree.pkl,data/lemma/labels/val/videoChat2_cider_all3Agree.pkl --multiBestViewAggregator-multiPseudoLabler --use-egovlpV2-patchLevelVisualFeats
python3 test_lemma.py --isLemma-dataset --data-parallel --run-dir runs/lemma_release --use-datapointVideoClips --unfreeze-videoEncoder --use-egovlpV2-patchLevelVisualFeats
Auto-metric computation scripts COMING SOON!
@InProceedings{Majumder_2025_CVPR,
author = {Majumder, Sagnik and Nagarajan, Tushar and Al-Halah, Ziad and Pradhan, Reina and Grauman, Kristen},
title = {Which Viewpoint Shows it Best? Language for Weakly Supervising View Selection in Multi-view Instructional Videos},
booktitle = {Proceedings of the Computer Vision and Pattern Recognition Conference (CVPR)},
month = {June},
year = {2025},
pages = {29016-29028}
}
This project is released under the MIT license, as found in the LICENSE file.
