Skip to content

reactmultimodalchallenge/baseline_react2025

Repository files navigation

Official baseline code for the Third REACT Challenge (REACT2025)

[Homepage] [Reference Paper (TBA)] [Code]

This repository provides baseline methods for the Third REACT Challenge

Baseline paper:

MARS dataset:

Challenge Description

Given the spatio-temporal behaviours expressed by a speaker at the time period, the proposed REACT 2025 Challenge will consist of the following two sub-challenges whose theoretical underpinnings have been defined and detailed in this paper.

Task 1 - Offline Appropriate Facial Reaction Generation

This task aims to develop a deep learning model that takes the entire speaker behaviour sequence as the input, and generates multiple appropriate and realistic / naturalistic spatio-temporal facial reactions, consisting of AUs, facial expressions, valence and arousal state representing the predicted facial reaction. As a result, facial reactions are required to be generated for the task given each input speaker behaviour.

Task 2 - Online Appropriate Facial Reaction Generation

This task aims to develop a deep learning model that estimates each frame, rather than taking all frames into consideration. The model is expected to gradually generate all facial reaction frames to form multiple appropriate and realistic / naturalistic spatio-temporal facial reactions consisting of AUs, facial expressions, valence and arousal state representing the predicted facial reaction. As a result, facial reactions are required to be generated for the task given each input speaker behaviour.

🛠️ Dependency Installation

We provide detailed instructions for setting up the environment using conda. First, create and activate a new environment:

conda create -n react python=3.10
conda activate react

1. Install PyTorch

First, check your CUDA version:

nvidia-smi

Visit Pytorch official website to get the appropriate installation command. For example:

conda install pytorch==2.0.0 torchvision==0.15.0 torchaudio==2.0.0 pytorch-cuda=11.8 -c pytorch -c nvidia

2. Install PyTorch3D Dependencies

Install the following dependencies:

conda install -c fvcore -c iopath -c conda-forge fvcore iopath

For CUDA versions older than 11.7, you will need to install the CUB library.

conda install -c bottler nvidiacub

3. Install PyTorch3D

First, verify your CUDA version in Python:

import torch
torch.version.cuda

Download the appropriate PyTorch3D package from Anaconda based on your Python, CUDA, and PyTorch versions. For example, for Python 3.10, CUDA 11.6, and PyTorch 1.12.0:

# linux-64_pytorch3d-0.7.5-py310_cu116_pyt1120.tar.bz2
conda install linux-64_pytorch3d-0.7.5-py310_cu116_pyt1120.tar.bz2

4. Install Additional Dependencies

Install all remaining dependencies specified in requirements.txt:

pip install -r requirements.txt

👨‍🏫 Get Started

Data

Challenge Data Description (Homepage):

We divided the datasets into training, test, and validation sets following an estimated 60%/20%/20% splitting ratio. Specifically, we split the datasets with a subject-independent strategy (i.e., the same subject was never included in the train and test sets).

  • video-raw folder contains raw videos (with the resolution of 1920 * 1080)
  • video-face-crop folder contains face-cropped videos (with the resolution of 384 * 384)
  • facial-attributes folder contains sequences of frame-level 25-dimension facial attributes (15 AUs’ occurrences, valence and arousal intensities, and the probabilities of eight categorical facial expressions)
  • coefficients folder contains sequences of 58-dimension (52-d expression, 3-d rotation, and 3-d translation) 3DMM coefficients extracted from corresponding videos
  • audio folder contains wav files extracted from raw video files

Appropriate real facial reactions (Ground-Truths):

  • During data recording, the semantic contexts are carefully controlled through the 23 distinct sessions (session0, session1, …, session22), each of which is guided by a few pre-defined sentences posted by the speaker. This provides a consistent session-specific context across dyadic interactions between different speakers and listeners. More specifically, for the speaker behaviour expressed in a specific session, we define all facial reactions expressed by different listeners under the same session to be appropriate facial reactions (i.e., ground-truth) for responding to it.

Data organization (./data) is listed below: The example of data structure.


├── val
├── test
├── train
    ├── coefficients (.npy)
    ├── video-face-crop (.mp4)
    ├── video-raw (.mp4)
        ├── speaker
            ├── session0
                ├── Camera-2024-06-21-103121-103102.mp4
                ├── ...
            ├── ...
            ├── session22
                ├── Camera-2024-07-17-104338-104241.mp4
                ├── ...
        ├── listener
            ├── session0
                ├── Camera-2024-06-21-103121-103102.mp4
                ├── ...
            ├── ...
            ├── session22
                ├── Camera-2024-07-17-104338-104241.mp4
                ├── ...
    ├── facial-attributes (.npy)
        ├── speaker
            ├── session0
                ├── Camera-2024-06-21-103121-103102.npy
                ├── ...
            ├── ...
            ├── session22
                ├── Camera-2024-07-17-104338-104241.npy
                ├── ...
        ├── listener
            ├── session0
                ├── Camera-2024-06-21-103121-103102.npy
                ├── ...
            ├── ...
            ├── session22
                ├── Camera-2024-07-17-104338-104241.npy
                ├── ...
    ├── audio (.wav)
        ├── speaker
            ├── session0
                ├── Camera-2024-06-21-103121-103102.wav
                ├── ...
            ├── ...
        ├── listener
            ├── session0
                ├── Camera-2024-06-21-103121-103102.wav
                ├── ...
            ├── ...

External Tool Preparation

We use 3DMM coefficients to represent a 3D listener or speaker, and for further 3D-to-2D frame rendering. The baselines leverage 3DMM model to extract 3DMM coefficients, and render 3D facial reactions.

  • You should first download 3DMM (FaceVerse version 2 model) at this page

    and then put it in the folder (external/FaceVerse/data/).

    We provide our extracted 3DMM coefficients (which are used for our baseline visualisation) at OneDrive.

    We also provide the mean_face.npy at this OneDrive link and std_face.npy at this OneDrive link and reference_full.npy at this Onedrive link for 3DMM coefficients Data Normalization. Please download and put them in the folder (external/FaceVerse/).

Then, we use a 3D-to-2D tool PIRender to render final 2D facial reaction frames.

  • We re-trained the PIRender, and the well-trained model is provided at the checkpoint. Please put it in the folder (external/PIRender/).

Finally, please download the compressed folder named pretrained_models from this link, and extract it into the project root directory.

Training

Trans-VAE

  • Running the following shell can start training Trans-VAE baseline for the offline task:
python main.py \
   data=motion_transvae \
   trainer=motion_transvae \
   trainer.batch_size=4 \
   trainer.max_seq_len=750 \
   trainer.window_size=8 \
   stage=fit \
   task=offline \
   data_dir=./data

    or for the online task:

python main.py \
  data=motion_transvae \
  trainer=motion_transvae \
  trainer.batch_size=2 \
  trainer.max_seq_len=256 \
  trainer.window_size=16 \
  stage=fit \
  task=online \
  data_dir=./data

PerFRDiff

  • Running the following shell can start training PerFRDiff baseline for the offline task:
python main.py \
    data=motion_diffusion \
    trainer=motion_diffusion \
    trainer.batch_size=2 \
    stage=fit \
    task=offline \
    data_dir=./data

    or for the online task:

python main.py \
    data=motion_diffusion \
    trainer=motion_diffusion \
    trainer.batch_size=8 \
    stage=fit \
    task=online \
    data_dir=./data

REGNN

  • Make sure you are in the folder regnn before running any cells related to REGNN.
  • First extract the image features using the pre-trained swin_transformer (pretrained weights already provided in pretrained_models).
python feature_extraction.py
  • Then train the REGNN by running the following shell:
python train.py \
    --logs-dir='Gmm-logs' \
    --milestones=9 \
    --batch-size=64 \
    --layers=2 \
    --norm \
    --neighbor-pattern='all' \
    --convert-type='direct' \
    --loss-mid \
    --data-dir=../data

Pretrained weights
  • to be released
Evaluation

For evaluation, please refer to test function in ./trainer/motion_diffusion.py (PerFRDiff baseline) or ./trainer/motion_transvae.py (Trans-VAE baseline). The metric computations are implemented in ./framework/utils/compute_metrics.py. The validation set can be treated as the test set by loading it via the provided dataloader file. As in the baseline paper, all facial reactions from different participants within the same session are defined as ground-truths. The pretrained model weights will be released soon.

Trans-VAE

  • Running the following shell can evaluate a trained Trans-VAE baseline for the offline task:
python main.py \
   data=motion_transvae \
   trainer=motion_transvae \
   trainer.batch_size=1 \
   trainer.max_seq_len=750 \
   trainer.window_size=8 \
   trainer.data_transform=zero_center \
   stage=test \
   task=offline \
   data_dir=/home/x/xk18/REACT2025 \
   resume_id=<train-experiment-id>

    or for the online task:

python main.py \
  data=motion_transvae \
  trainer=motion_transvae \
  trainer.batch_size=1 \
  trainer.max_seq_len=256 \
  trainer.window_size=16 \
  trainer.data_transform=zero_center \
  stage=test \
  task=online \
  data_dir=/home/x/xk18/REACT2025 \
  resume_id=<train-experiment-id>

PerFRDiff

  • Running the following shell can evaluate a trained PerFRDiff baseline for the offline task:
 python main.py \
    data=motion_diffusion \
    trainer=motion_diffusion \
    trainer.batch_size=1 \
    stage=test \
    task=offline \
    data_dir=/home/x/xk18/REACT2025 \
    resume_id=<train-experiment-id>

    or for the online task:

 python main.py \
    data=motion_diffusion \
    trainer=motion_diffusion \
    trainer.batch_size=1 \
    stage=test \
    task=online \
    data_dir=/home/x/xk18/REACT2025 \
    resume_id=<train-experiment-id>

🖊️ Citation

Submissions should cite the following papers:

Theory paper and baseline paper:

[1] Song, Siyang, Micol Spitale, Yiming Luo, Batuhan Bal, and Hatice Gunes. "Multiple Appropriate Facial Reaction Generation in Dyadic Interaction Settings: What, Why and How?." arXiv preprint arXiv:2302.06514 (2023).

[2] Song, Siyang, Micol Spitale, Cheng Luo, Cristina Palmero, German Barquero, Hengde Zhu, Sergio Escalera et al. "REACT 2024: the Second Multiple Appropriate Facial Reaction Generation Challenge." arXiv preprint arXiv:2401.05166 (2024).

[3] Song, Siyang, Micol Spitale, Cheng Luo, Germán Barquero, Cristina Palmero, Sergio Escalera, Michel Valstar et al. "REACT2023: The First Multiple Appropriate Facial Reaction Generation Challenge." In Proceedings of the 31st ACM International Conference on Multimedia, pp. 9620-9624. 2023.

Annotation, basic feature extraction tools and baselines:

[6] Song, Siyang, Yuxin Song, Cheng Luo, Zhiyuan Song, Selim Kuzucu, Xi Jia, Zhijiang Guo, Weicheng Xie, Linlin Shen, and Hatice Gunes. "GRATIS: Deep Learning Graph Representation with Task-specific Topology and Multi-dimensional Edge Features." arXiv preprint arXiv:2211.12482 (2022).

[7] Luo, Cheng, Siyang Song, Weicheng Xie, Linlin Shen, and Hatice Gunes. (2022, July) "Learning multi-dimensional edge feature-based au relation graph for facial action unit recognition." Proceedings of the Thirty-First International Joint Conference on Artificial Intelligence (pp. 1239-1246).

[8] Toisoul, Antoine, Jean Kossaifi, Adrian Bulat, Georgios Tzimiropoulos, and Maja Pantic. "Estimation of continuous valence and arousal levels from faces in naturalistic conditions." Nature Machine Intelligence 3, no. 1 (2021): 42-50.

[9] Eyben, Florian, Martin Wöllmer, and Björn Schuller. "Opensmile: the munich versatile and fast open-source audio feature extractor." In Proceedings of the 18th ACM international conference on Multimedia, pp. 1459-1462. 2010.

Submissions are encouraged to cite previous facial reaction generation papers:

[1] Huang, Yuchi, and Saad M. Khan. "Dyadgan: Generating facial expressions in dyadic interactions." In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, pp. 11-18. 2017.

[2] Huang, Yuchi, and Saad Khan. "A generative approach for dynamically varying photorealistic facial expressions in human-agent interactions." In Proceedings of the 20th ACM International Conference on Multimodal Interaction, pp. 437-445. 2018.

[3] Shao, Zilong, Siyang Song, Shashank Jaiswal, Linlin Shen, Michel Valstar, and Hatice Gunes. "Personality recognition by modelling person-specific cognitive processes using graph representation." In proceedings of the 29th ACM international conference on multimedia, pp. 357-366. 2021.

[4] Song, Siyang, Zilong Shao, Shashank Jaiswal, Linlin Shen, Michel Valstar, and Hatice Gunes. "Learning Person-specific Cognition from Facial Reactions for Automatic Personality Recognition." IEEE Transactions on Affective Computing (2022).

[5] Barquero, German, Sergio Escalera, and Cristina Palmero. "Belfusion: Latent diffusion for behavior-driven human motion prediction." In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 2317-2327. 2023.

[6] Zhou, Mohan, Yalong Bai, Wei Zhang, Ting Yao, Tiejun Zhao, and Tao Mei. "Responsive listening head generation: a benchmark dataset and baseline." In Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXVIII, pp. 124-142. Cham: Springer Nature Switzerland, 2022.

[7] Luo, Cheng, Siyang Song, Weicheng Xie, Micol Spitale, Linlin Shen, and Hatice Gunes. "ReactFace: Multiple Appropriate Facial Reaction Generation in Dyadic Interactions." arXiv preprint arXiv:2305.15748 (2023).

[8] Xu, Tong, Micol Spitale, Hao Tang, Lu Liu, Hatice Gunes, and Siyang Song. "Reversible Graph Neural Network-based Reaction Distribution Learning for Multiple Appropriate Facial Reactions Generation." arXiv preprint arXiv:2305.15270 (2023).

[9] Liang, Cong, Jiahe Wang, Haofan Zhang, Bing Tang, Junshan Huang, Shangfei Wang, and Xiaoping Chen. "Unifarn: Unified transformer for facial reaction generation." In Proceedings of the 31st ACM International Conference on Multimedia, pp. 9506-9510. 2023.

[10] Yu, Jun, Ji Zhao, Guochen Xie, Fengxin Chen, Ye Yu, Liang Peng, Minglei Li, and Zonghong Dai. "Leveraging the latent diffusion models for offline facial multiple appropriate reactions generation." In Proceedings of the 31st ACM International Conference on Multimedia, pp. 9561-9565. 2023.

[11] Hoque, Ximi, Adamay Mann, Gulshan Sharma, and Abhinav Dhall. "BEAMER: Behavioral Encoder to Generate Multiple Appropriate Facial Reactions." In Proceedings of the 31st ACM International Conference on Multimedia, pp. 9536-9540. 2023.

[12] Zhu, Hengde, Xiangyu Kong, Weicheng Xie, Xin Huang, Linlin Shen, Lu Liu, Hatice Gunes, and Siyang Song. "Perfrdiff: Personalised weight editing for multiple appropriate facial reaction generation." In Proceedings of the 32nd ACM International Conference on Multimedia, pp. 9495-9504. 2024.

[13] Zhu, Hengde, Xiangyu Kong, Weicheng Xie, Xin Huang, Xilin He, Lu Liu, Linlin Shen, Wei Zhang, Hatice Gunes, and Siyang Song. "PerReactor: Offline Personalised Multiple Appropriate Facial Reaction Generation." In Proceedings of the AAAI Conference on Artificial Intelligence, vol. 39, no. 2, pp. 1665-1673. 2025.

🤝 Acknowledgement

Thanks to the open source of the following projects:

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •