This repository provides a solution for preprocessing American Sign Language (ASL) datasets, following the method from "YouTube-ASL: A Large-Scale, Open-Domain American Sign Language-English Parallel Corpus" (Uthus et al., 2023) which is designed to handle both YouTube-ASL and How2Sign datasets. Our pipeline streamlines the workflow from video acquisition to landmark extraction, preparing the data for ASL translation tasks.
All project settings are managed through conf.py, offering a single configuration point for the preprocessing pipeline. Key elements include:
-
ID: Text file containing YouTube video IDs to process -
VIDEO_DIR: Directory for downloaded videos -
TRANSCRIPT_DIR: Storage for JSON transcripts -
OUTPUT_DIR: Location for extracted features -
CSV_FILE: Path for processed segment data -
YT_CONFIG: YouTube download settings (video quality, format, rate limits) -
LANGUAGE: Supported language options for transcript retrieval -
FRAME_SKIP: Controls frame sampling rate for efficient processing -
MAX_WORKERS: Manages parallel processing to optimize performance -
POSE_IDX,FACE_IDX,HAND_IDX: Selected landmark indices for extracting relevant points for sign language analysis
- Ensure the constants in
conf.pyare correct. - Run the following steps in order:
-
Step 1: Data Acquisition (
s1_data_downloader.py)- Necessary Constants:
ID,VIDEO_DIR,TRANSCRIPT_DIR,YT_CONFIG,LANGUAGE - The script skips already downloaded content and implements rate limiting to prevent API throttling.
- Necessary Constants:
-
Step 2: Transcript Processing (
s2_transcript_preprocess.py)- Necessary Constants:
ID,TRANSCRIPT_DIR,CSV_FILE - This step cleans text (converts Unicode characters, removes brackets), filters segments based on length and duration, and saves them with precise timestamps as tab-separated values.
- Necessary Constants:
-
Step 3: Feature Extraction (
s3_mediapipe_labelling.py)- Necessary Constants:
CSV_FILE,VIDEO_DIR,OUTPUT_DIR,MAX_WORKERS,FRAME_SKIP,POSE_IDX,FACE_IDX,HAND_IDX
- Necessary Constants:
-
- The script processes each video segment according to its timestamp, extracting only the most relevant body keypoints for sign language analysis. It uses parallel processing to handle multiple video efficiently. Results are saved as NumPy arrays.
- Download Green Screen RGB videos and English Translation (manually re-aligned) from the How2Sign website.
- Place the directory and .csv file in the correct path or amend the path in
conf.py. - Run Step 3: Feature Extraction (
s3_mediapipe_labelling.py) only.
- Video List: GitHub Repository
- Paper: "YouTube-ASL: A Large-Scale, Open-Domain American Sign Language-English Parallel Corpus" (Uthus et al., 2023)
If you use YouTube-ASL, please cite their associated paper:
@misc{uthus2023youtubeasl,
author = {Uthus, David and Tanzer, Garrett and Georg, Manfred},
title = {YouTube-ASL: A Large-Scale, Open-Domain American Sign Language-English Parallel Corpus},
year = {2023},
eprint = {2306.15162},
archivePrefix = {arXiv},
url = {https://arxiv.org/abs/2306.15162},
}- Dataset: How2Sign Website
- Paper: How2Sign: A Large-scale Multimodal Dataset for Continuous American Sign Language
If you use How2Sign, please cite their associated paper:
@inproceedings{Duarte_CVPR2021,
title={{How2Sign: A Large-scale Multimodal Dataset for Continuous American Sign Language}},
author={Duarte, Amanda and Palaskar, Shruti and Ventura, Lucas and Ghadiyaram, Deepti and DeHaan, Kenneth and
Metze, Florian and Torres, Jordi and Giro-i-Nieto, Xavier},
booktitle={Conference on Computer Vision and Pattern Recognition (CVPR)},
year={2021}
}