|
1 | 1 | # ASL Translation Data Preprocessing |
2 | 2 |
|
3 | | -This repository provides a comprehensive solution for preprocessing American Sign Language (ASL) datasets, specifically designed to handle both How2Sign and YouTube ASL datasets. Our preprocessing pipeline streamlines the workflow from video acquisition to landmark extraction, making the data ready for ASL translation tasks. |
| 3 | +This repository follows the methodology described in the YouTube-ASL Dataset paper and provides a comprehensive solution for preprocessing American Sign Language (ASL) datasets, specifically designed to handle both **How2Sign** and **YouTube-ASL** datasets. Our preprocessing pipeline streamlines the workflow from video acquisition to landmark extraction, making the data ready for ASL translation tasks. |
4 | 4 |
|
5 | 5 | ## Project Configuration |
6 | 6 |
|
7 | | -All project settings are centrally managed through `conf.py`, offering flexible configuration options for video processing, dataset management, and feature extraction. The configuration file controls several key aspects: |
| 7 | +All project settings are centrally managed through `conf.py`, providing a single point of configuration for the entire preprocessing pipeline. Key configuration elements include: |
8 | 8 |
|
9 | | -The system allows customization of video processing parameters, including frame skip rates and maximum frame limits, to optimize processing efficiency while maintaining data quality. It manages dataset paths and directories, ensuring organized data storage and retrieval. The configuration also specifies MediaPipe landmark indices for detailed capture of pose, face, and hand movements, essential for ASL translation. Additionally, it includes language preference settings for YouTube transcript collection, supporting various English language variants. |
| 9 | +- `ID`: Text file containing YouTube video IDs to process |
| 10 | +- `VIDEO_DIR`: Directory for downloaded videos |
| 11 | +- `TRANSCRIPT_DIR`: Storage for JSON transcripts |
| 12 | +- `OUTPUT_DIR`: Location for extracted features |
| 13 | +- `CSV_FILE`: Path for processed segment data |
10 | 14 |
|
11 | | -## YouTube ASL Dataset Processing |
| 15 | +- `YT_CONFIG`: YouTube download settings (video quality, format, rate limits) |
| 16 | +- `LANGUAGE`: Supported language options for transcript retrieval |
| 17 | +- `FRAME_SKIP`: Controls frame sampling rate for efficient processing |
| 18 | +- `MAX_WORKERS`: Manages parallel processing to optimize performance |
12 | 19 |
|
13 | | -The processing of YouTube ASL dataset follows a systematic three-step approach, ensuring comprehensive data preparation: |
| 20 | +- `POSE_IDX`, `FACE_IDX`, `HAND_IDX`: Selected landmark indices for extracting the most relevant points for sign language analysis |
14 | 21 |
|
15 | | -### Step 1: Data Acquisition |
| 22 | +This centralized approach allows easy adaptation to different hardware capabilities or dataset requirements without modifying the core processing code. |
| 23 | +## How To Use? |
| 24 | +- **YouTube-ASL**: make sure the constant is correct in conf.py. Then, operate step 1 to step 3. |
| 25 | +- **How2Sign**: download **Green Screen RGB videos** and **English Translation (manually re-aligned)** from How2Sign website. Put the directory and .csv file in the right path or amend the path in the conf.py. then, operate step 3 only. |
| 26 | + |
| 27 | +### Step 1: Data Acquisition (s1_data_downloader.py) |
| 28 | +**Necessary Constants:**`ID`, `VIDEO_DIR`, `TRANSCRIPT_DIR`, `YT_CONFIG`, `LANGUAGE` |
| 29 | +The script intelligently skips already downloaded content and implements rate limiting to prevent API throttling. |
16 | 30 |
|
17 | | -This initial phase combines two parallel processes: |
18 | 31 |
|
19 | | -The video downloader (`s1_video_download.py`) efficiently retrieves YouTube videos using yt-dlp, implementing smart rate limiting and quality control measures. It includes features for parallel fragment downloads and automatically skips previously downloaded content to prevent redundant processing. |
20 | 32 |
|
21 | | -Simultaneously, the transcript collector (`s1_transcript_downloader.py`) obtains video transcripts through the YouTube Transcript API. This component handles multiple English language variants and saves the transcripts in a structured JSON format, while maintaining appropriate rate limits to ensure reliable data collection. |
| 33 | +### Step 2: Transcript Processing (s2_transcript_preprocess.py) |
| 34 | +**Necessary Constants:** `ID`, `TRANSCRIPT_DIR`, `CSV_FILE` |
| 35 | +This step cleans text (converts Unicode characters, removes brackets), filters segments based on length and duration, and saves them with precise timestamps as tab-separated values. |
22 | 36 |
|
23 | | -### Step 2: Transcript Enhancement |
24 | 37 |
|
25 | | -The transcript processor (`s2_transcript_preprocess.py`) refines the raw transcript data into a format suitable for ASL translation. It performs sophisticated text normalization, including Unicode handling and ASCII conversion, while preserving semantic meaning. The system segments videos into overlapping chunks with precise timing information, generating well-structured CSV files containing the processed segments. |
26 | 38 |
|
27 | | -### Step 3: Feature Extraction |
| 39 | +### Step 3: Feature Extraction (s3_mediapipe_labelling.py) |
| 40 | +**Necessary Constants:** `CSV_FILE`, `VIDEO_DIR`, `OUTPUT_DIR`, `MAX_WORKERS`, `FRAME_SKIP`, `POSE_IDX`, `FACE_IDX`, `HAND_IDX` |
| 41 | +The script processes each video segment according to its timestamp, extracting only the most relevant body keypoints for sign language analysis. Results are saved as NumPy arrays. |
28 | 42 |
|
29 | | -The landmark extraction system (`s3_mediapipe_labelling.py`) utilizes the MediaPipe Holistic model to capture detailed movement data. It processes video segments to extract comprehensive pose, face, and hand landmarks, leveraging parallel processing capabilities for efficient computation. The extracted features are stored as numpy arrays for subsequent analysis and translation tasks. |
| 43 | +## Dataset Introduction |
30 | 44 |
|
31 | | -## How2Sign Dataset Processing |
| 45 | +### YouTube-ASL Dataset |
| 46 | +Video List: [https://github.com/google-research/google-research/blob/master/youtube_asl/README.md](https://github.com/google-research/google-research/blob/master/youtube_asl/README.md) |
| 47 | +Paper: ["YouTube-ASL: A Large-Scale, Open-Domain American Sign Language-English Parallel Corpus" (Uthus et al., 2023)](https://arxiv.org/abs/2306.15162). |
32 | 48 |
|
33 | | -For the How2Sign dataset, our system offers two specialized approaches for MediaPipe landmark extraction: |
| 49 | +If you use YouTube-ASL, please cite their associated paper: |
34 | 50 |
|
35 | | -### Clip-Based Processing |
| 51 | +``` |
| 52 | +@misc{uthus2023youtubeasl, |
| 53 | + author = {Uthus, David and Tanzer, Garrett and Georg, Manfred}, |
| 54 | + title = {YouTube-ASL: A Large-Scale, Open-Domain American Sign Language-English Parallel Corpus}, |
| 55 | + year = {2023}, |
| 56 | + eprint = {2306.15162}, |
| 57 | + archivePrefix = {arXiv}, |
| 58 | + url = {https://arxiv.org/abs/2306.15162}, |
| 59 | +} |
| 60 | +``` |
36 | 61 |
|
37 | | -The clip processor (`H2S_clip_mediapipe.py`) handles complete video clips in a single pass. It employs adaptive frame skipping to optimize processing speed while maintaining data quality. The system leverages parallel processing capabilities to handle multiple clips simultaneously, ensuring efficient resource utilization. |
| 62 | +### How2Sign Dataset |
| 63 | +Dataset: [https://how2sign.github.io/](https://how2sign.github.io/) |
| 64 | +Paper: [How2Sign: A Large-scale Multimodal Dataset for Continuous American Sign Language](https://openaccess.thecvf.com/content/CVPR2021/html/Duarte_How2Sign_A_Large-Scale_Multimodal_Dataset_for_Continuous_American_Sign_Language_CVPR_2021_paper.html) |
38 | 65 |
|
39 | | -### Raw Video Processing |
| 66 | +If you use How2Sign, please cite their associated paper: |
40 | 67 |
|
41 | | -The raw video processor (`H2S_raw_mediapipe.py`) takes a more granular approach, working with precise realigned timestamps from a CSV file. This method extracts landmarks for specific video segments, maintaining temporal accuracy while utilizing parallel processing for optimal performance. |
42 | | - |
43 | | -## Data Organization |
44 | | - |
45 | | -The system organizes processed data into clearly defined formats: |
46 | | -- Video content is stored as MP4 files for optimal quality and compatibility |
47 | | -- Transcripts are maintained in JSON format for easy parsing and manipulation |
48 | | -- Segment information is organized in CSV files for straightforward analysis |
49 | | -- Extracted landmarks are preserved as NumPy arrays (.npy files) for efficient processing |
50 | | - |
51 | | -## Technical Requirements |
52 | | - |
53 | | -The system relies on several key Python libraries: |
54 | | -- OpenCV (cv2) for video processing |
55 | | -- MediaPipe for pose and gesture recognition |
56 | | -- NumPy for efficient numerical operations |
57 | | -- Pandas for data manipulation |
58 | | -- yt-dlp for video downloading |
59 | | -- youtube-transcript-api for transcript retrieval |
60 | | - |
61 | | -This preprocessing pipeline creates a robust foundation for ASL translation tasks, ensuring high-quality data preparation while maintaining processing efficiency. |
| 68 | +``` |
| 69 | +@inproceedings{Duarte_CVPR2021, |
| 70 | + title={{How2Sign: A Large-scale Multimodal Dataset for Continuous American Sign Language}}, |
| 71 | + author={Duarte, Amanda and Palaskar, Shruti and Ventura, Lucas and Ghadiyaram, Deepti and DeHaan, Kenneth and |
| 72 | + Metze, Florian and Torres, Jordi and Giro-i-Nieto, Xavier}, |
| 73 | + booktitle={Conference on Computer Vision and Pattern Recognition (CVPR)}, |
| 74 | + year={2021} |
| 75 | +} |
| 76 | +``` |
0 commit comments