[Homepage] [Reference Paper (TBA)] [Code]
This repository provides baseline methods for the Third REACT Challenge
- Please send the signed EULA (https://github.com/reactmultimodalchallenge/baseline_react2025/blob/main/EULA_MARS%20dataset.pdf) to Dr Siyang Song at [email protected]
Given the spatio-temporal behaviours expressed by a speaker at the time period, the proposed REACT 2025 Challenge will consist of the following two sub-challenges whose theoretical underpinnings have been defined and detailed in this paper.
This task aims to develop a deep learning model that takes the entire speaker behaviour sequence as the input, and generates multiple appropriate and realistic / naturalistic spatio-temporal facial reactions, consisting of AUs, facial expressions, valence and arousal state representing the predicted facial reaction. As a result, facial reactions are required to be generated for the task given each input speaker behaviour.
This task aims to develop a deep learning model that estimates each frame, rather than taking all frames into consideration. The model is expected to gradually generate all facial reaction frames to form multiple appropriate and realistic / naturalistic spatio-temporal facial reactions consisting of AUs, facial expressions, valence and arousal state representing the predicted facial reaction. As a result, facial reactions are required to be generated for the task given each input speaker behaviour.
We provide detailed instructions for setting up the environment using conda. First, create and activate a new environment:
conda create -n react python=3.10
conda activate reactFirst, check your CUDA version:
nvidia-smiVisit Pytorch official website to get the appropriate installation command. For example:
conda install pytorch==2.0.0 torchvision==0.15.0 torchaudio==2.0.0 pytorch-cuda=11.8 -c pytorch -c nvidiaInstall the following dependencies:
conda install -c fvcore -c iopath -c conda-forge fvcore iopathFor CUDA versions older than 11.7, you will need to install the CUB library.
conda install -c bottler nvidiacubFirst, verify your CUDA version in Python:
import torch
torch.version.cudaDownload the appropriate PyTorch3D package from Anaconda based on your Python, CUDA, and PyTorch versions. For example, for Python 3.10, CUDA 11.6, and PyTorch 1.12.0:
# linux-64_pytorch3d-0.7.5-py310_cu116_pyt1120.tar.bz2
conda install linux-64_pytorch3d-0.7.5-py310_cu116_pyt1120.tar.bz2Install all remaining dependencies specified in requirements.txt:
pip install -r requirements.txtData
Challenge Data Description (Homepage):
We divided the datasets into training, test, and validation sets following an estimated 60%/20%/20% splitting ratio. Specifically, we split the datasets with a subject-independent strategy (i.e., the same subject was never included in the train and test sets).
- video-raw folder contains raw videos (with the resolution of 1920 * 1080)
- video-face-crop folder contains face-cropped videos (with the resolution of 384 * 384)
- facial-attributes folder contains sequences of frame-level 25-dimension facial attributes (15 AUs’ occurrences, valence and arousal intensities, and the probabilities of eight categorical facial expressions)
- coefficients folder contains sequences of 58-dimension (52-d expression, 3-d rotation, and 3-d translation) 3DMM coefficients extracted from corresponding videos
- audio folder contains wav files extracted from raw video files
Appropriate real facial reactions (Ground-Truths):
- During data recording, the semantic contexts are carefully controlled through the 23 distinct sessions (session0, session1, …, session22), each of which is guided by a few pre-defined sentences posted by the speaker. This provides a consistent session-specific context across dyadic interactions between different speakers and listeners. More specifically, for the speaker behaviour expressed in a specific session, we define all facial reactions expressed by different listeners under the same session to be appropriate facial reactions (i.e., ground-truth) for responding to it.
Data organization (./data) is listed below:
The example of data structure.
├── val
├── test
├── train
├── coefficients (.npy)
├── video-face-crop (.mp4)
├── video-raw (.mp4)
├── speaker
├── session0
├── Camera-2024-06-21-103121-103102.mp4
├── ...
├── ...
├── session22
├── Camera-2024-07-17-104338-104241.mp4
├── ...
├── listener
├── session0
├── Camera-2024-06-21-103121-103102.mp4
├── ...
├── ...
├── session22
├── Camera-2024-07-17-104338-104241.mp4
├── ...
├── facial-attributes (.npy)
├── speaker
├── session0
├── Camera-2024-06-21-103121-103102.npy
├── ...
├── ...
├── session22
├── Camera-2024-07-17-104338-104241.npy
├── ...
├── listener
├── session0
├── Camera-2024-06-21-103121-103102.npy
├── ...
├── ...
├── session22
├── Camera-2024-07-17-104338-104241.npy
├── ...
├── audio (.wav)
├── speaker
├── session0
├── Camera-2024-06-21-103121-103102.wav
├── ...
├── ...
├── listener
├── session0
├── Camera-2024-06-21-103121-103102.wav
├── ...
├── ...
External Tool Preparation
We use 3DMM coefficients to represent a 3D listener or speaker, and for further 3D-to-2D frame rendering. The baselines leverage 3DMM model to extract 3DMM coefficients, and render 3D facial reactions.
-
You should first download 3DMM (FaceVerse version 2 model) at this page
and then put it in the folder (
external/FaceVerse/data/).We provide our extracted 3DMM coefficients (which are used for our baseline visualisation) at OneDrive.
We also provide the
mean_face.npyat this OneDrive link andstd_face.npyat this OneDrive link andreference_full.npyat this Onedrive link for 3DMM coefficients Data Normalization. Please download and put them in the folder (external/FaceVerse/).
Then, we use a 3D-to-2D tool PIRender to render final 2D facial reaction frames.
- We re-trained the PIRender, and the well-trained model is provided at the checkpoint. Please put it in the folder (
external/PIRender/).
Finally, please download the compressed folder named pretrained_models from this link, and extract it into the project root directory.
Training
Trans-VAE
- Running the following shell can start training Trans-VAE baseline for the offline task:
python main.py \
data=motion_transvae \
trainer=motion_transvae \
trainer.batch_size=4 \
trainer.max_seq_len=750 \
trainer.window_size=8 \
stage=fit \
task=offline \
data_dir=./dataor for the online task:
python main.py \
data=motion_transvae \
trainer=motion_transvae \
trainer.batch_size=2 \
trainer.max_seq_len=256 \
trainer.window_size=16 \
stage=fit \
task=online \
data_dir=./dataPerFRDiff
- Running the following shell can start training PerFRDiff baseline for the offline task:
python main.py \
data=motion_diffusion \
trainer=motion_diffusion \
trainer.batch_size=2 \
stage=fit \
task=offline \
data_dir=./dataor for the online task:
python main.py \
data=motion_diffusion \
trainer=motion_diffusion \
trainer.batch_size=8 \
stage=fit \
task=online \
data_dir=./dataREGNN
- Make sure you are in the folder
regnnbefore running any cells related to REGNN. - First extract the image features using the pre-trained swin_transformer (pretrained weights already provided in
pretrained_models).
python feature_extraction.py- Then train the REGNN by running the following shell:
python train.py \
--logs-dir='Gmm-logs' \
--milestones=9 \
--batch-size=64 \
--layers=2 \
--norm \
--neighbor-pattern='all' \
--convert-type='direct' \
--loss-mid \
--data-dir=../dataPretrained weights
- to be released
Evaluation
For evaluation, please refer to test function in ./trainer/motion_diffusion.py (PerFRDiff baseline) or ./trainer/motion_transvae.py (Trans-VAE baseline). The metric computations are implemented in ./framework/utils/compute_metrics.py. The validation set can be treated as the test set by loading it via the provided dataloader file. As in the baseline paper, all facial reactions from different participants within the same session are defined as ground-truths.
The pretrained model weights will be released soon.
Trans-VAE
- Running the following shell can evaluate a trained Trans-VAE baseline for the offline task:
python main.py \
data=motion_transvae \
trainer=motion_transvae \
trainer.batch_size=1 \
trainer.max_seq_len=750 \
trainer.window_size=8 \
trainer.data_transform=zero_center \
stage=test \
task=offline \
data_dir=/home/x/xk18/REACT2025 \
resume_id=<train-experiment-id>or for the online task:
python main.py \
data=motion_transvae \
trainer=motion_transvae \
trainer.batch_size=1 \
trainer.max_seq_len=256 \
trainer.window_size=16 \
trainer.data_transform=zero_center \
stage=test \
task=online \
data_dir=/home/x/xk18/REACT2025 \
resume_id=<train-experiment-id>PerFRDiff
- Running the following shell can evaluate a trained PerFRDiff baseline for the offline task:
python main.py \
data=motion_diffusion \
trainer=motion_diffusion \
trainer.batch_size=1 \
stage=test \
task=offline \
data_dir=/home/x/xk18/REACT2025 \
resume_id=<train-experiment-id>or for the online task:
python main.py \
data=motion_diffusion \
trainer=motion_diffusion \
trainer.batch_size=1 \
stage=test \
task=online \
data_dir=/home/x/xk18/REACT2025 \
resume_id=<train-experiment-id>[1] Song, Siyang, Micol Spitale, Yiming Luo, Batuhan Bal, and Hatice Gunes. "Multiple Appropriate Facial Reaction Generation in Dyadic Interaction Settings: What, Why and How?." arXiv preprint arXiv:2302.06514 (2023).
[2] Song, Siyang, Micol Spitale, Cheng Luo, Cristina Palmero, German Barquero, Hengde Zhu, Sergio Escalera et al. "REACT 2024: the Second Multiple Appropriate Facial Reaction Generation Challenge." arXiv preprint arXiv:2401.05166 (2024).
[3] Song, Siyang, Micol Spitale, Cheng Luo, Germán Barquero, Cristina Palmero, Sergio Escalera, Michel Valstar et al. "REACT2023: The First Multiple Appropriate Facial Reaction Generation Challenge." In Proceedings of the 31st ACM International Conference on Multimedia, pp. 9620-9624. 2023.
[6] Song, Siyang, Yuxin Song, Cheng Luo, Zhiyuan Song, Selim Kuzucu, Xi Jia, Zhijiang Guo, Weicheng Xie, Linlin Shen, and Hatice Gunes. "GRATIS: Deep Learning Graph Representation with Task-specific Topology and Multi-dimensional Edge Features." arXiv preprint arXiv:2211.12482 (2022).
[7] Luo, Cheng, Siyang Song, Weicheng Xie, Linlin Shen, and Hatice Gunes. (2022, July) "Learning multi-dimensional edge feature-based au relation graph for facial action unit recognition." Proceedings of the Thirty-First International Joint Conference on Artificial Intelligence (pp. 1239-1246).
[8] Toisoul, Antoine, Jean Kossaifi, Adrian Bulat, Georgios Tzimiropoulos, and Maja Pantic. "Estimation of continuous valence and arousal levels from faces in naturalistic conditions." Nature Machine Intelligence 3, no. 1 (2021): 42-50.
[9] Eyben, Florian, Martin Wöllmer, and Björn Schuller. "Opensmile: the munich versatile and fast open-source audio feature extractor." In Proceedings of the 18th ACM international conference on Multimedia, pp. 1459-1462. 2010.
[1] Huang, Yuchi, and Saad M. Khan. "Dyadgan: Generating facial expressions in dyadic interactions." In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, pp. 11-18. 2017.
[2] Huang, Yuchi, and Saad Khan. "A generative approach for dynamically varying photorealistic facial expressions in human-agent interactions." In Proceedings of the 20th ACM International Conference on Multimodal Interaction, pp. 437-445. 2018.
[3] Shao, Zilong, Siyang Song, Shashank Jaiswal, Linlin Shen, Michel Valstar, and Hatice Gunes. "Personality recognition by modelling person-specific cognitive processes using graph representation." In proceedings of the 29th ACM international conference on multimedia, pp. 357-366. 2021.
[4] Song, Siyang, Zilong Shao, Shashank Jaiswal, Linlin Shen, Michel Valstar, and Hatice Gunes. "Learning Person-specific Cognition from Facial Reactions for Automatic Personality Recognition." IEEE Transactions on Affective Computing (2022).
[5] Barquero, German, Sergio Escalera, and Cristina Palmero. "Belfusion: Latent diffusion for behavior-driven human motion prediction." In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 2317-2327. 2023.
[6] Zhou, Mohan, Yalong Bai, Wei Zhang, Ting Yao, Tiejun Zhao, and Tao Mei. "Responsive listening head generation: a benchmark dataset and baseline." In Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXVIII, pp. 124-142. Cham: Springer Nature Switzerland, 2022.
[7] Luo, Cheng, Siyang Song, Weicheng Xie, Micol Spitale, Linlin Shen, and Hatice Gunes. "ReactFace: Multiple Appropriate Facial Reaction Generation in Dyadic Interactions." arXiv preprint arXiv:2305.15748 (2023).
[8] Xu, Tong, Micol Spitale, Hao Tang, Lu Liu, Hatice Gunes, and Siyang Song. "Reversible Graph Neural Network-based Reaction Distribution Learning for Multiple Appropriate Facial Reactions Generation." arXiv preprint arXiv:2305.15270 (2023).
[9] Liang, Cong, Jiahe Wang, Haofan Zhang, Bing Tang, Junshan Huang, Shangfei Wang, and Xiaoping Chen. "Unifarn: Unified transformer for facial reaction generation." In Proceedings of the 31st ACM International Conference on Multimedia, pp. 9506-9510. 2023.
[10] Yu, Jun, Ji Zhao, Guochen Xie, Fengxin Chen, Ye Yu, Liang Peng, Minglei Li, and Zonghong Dai. "Leveraging the latent diffusion models for offline facial multiple appropriate reactions generation." In Proceedings of the 31st ACM International Conference on Multimedia, pp. 9561-9565. 2023.
[11] Hoque, Ximi, Adamay Mann, Gulshan Sharma, and Abhinav Dhall. "BEAMER: Behavioral Encoder to Generate Multiple Appropriate Facial Reactions." In Proceedings of the 31st ACM International Conference on Multimedia, pp. 9536-9540. 2023.
[12] Zhu, Hengde, Xiangyu Kong, Weicheng Xie, Xin Huang, Linlin Shen, Lu Liu, Hatice Gunes, and Siyang Song. "Perfrdiff: Personalised weight editing for multiple appropriate facial reaction generation." In Proceedings of the 32nd ACM International Conference on Multimedia, pp. 9495-9504. 2024.
[13] Zhu, Hengde, Xiangyu Kong, Weicheng Xie, Xin Huang, Xilin He, Lu Liu, Linlin Shen, Wei Zhang, Hatice Gunes, and Siyang Song. "PerReactor: Offline Personalised Multiple Appropriate Facial Reaction Generation." In Proceedings of the AAAI Conference on Artificial Intelligence, vol. 39, no. 2, pp. 1665-1673. 2025.
Thanks to the open source of the following projects: