Skip to content

Commit e2d5ac3

Browse files
committed
2 parents 500e637 + 4f8bec2 commit e2d5ac3

File tree

1 file changed

+54
-39
lines changed

1 file changed

+54
-39
lines changed

README.md

Lines changed: 54 additions & 39 deletions
Original file line numberDiff line numberDiff line change
@@ -1,61 +1,76 @@
11
# ASL Translation Data Preprocessing
22

3-
This repository provides a comprehensive solution for preprocessing American Sign Language (ASL) datasets, specifically designed to handle both How2Sign and YouTube ASL datasets. Our preprocessing pipeline streamlines the workflow from video acquisition to landmark extraction, making the data ready for ASL translation tasks.
3+
This repository follows the methodology described in the YouTube-ASL Dataset paper and provides a comprehensive solution for preprocessing American Sign Language (ASL) datasets, specifically designed to handle both **How2Sign** and **YouTube-ASL** datasets. Our preprocessing pipeline streamlines the workflow from video acquisition to landmark extraction, making the data ready for ASL translation tasks.
44

55
## Project Configuration
66

7-
All project settings are centrally managed through `conf.py`, offering flexible configuration options for video processing, dataset management, and feature extraction. The configuration file controls several key aspects:
7+
All project settings are centrally managed through `conf.py`, providing a single point of configuration for the entire preprocessing pipeline. Key configuration elements include:
88

9-
The system allows customization of video processing parameters, including frame skip rates and maximum frame limits, to optimize processing efficiency while maintaining data quality. It manages dataset paths and directories, ensuring organized data storage and retrieval. The configuration also specifies MediaPipe landmark indices for detailed capture of pose, face, and hand movements, essential for ASL translation. Additionally, it includes language preference settings for YouTube transcript collection, supporting various English language variants.
9+
- `ID`: Text file containing YouTube video IDs to process
10+
- `VIDEO_DIR`: Directory for downloaded videos
11+
- `TRANSCRIPT_DIR`: Storage for JSON transcripts
12+
- `OUTPUT_DIR`: Location for extracted features
13+
- `CSV_FILE`: Path for processed segment data
1014

11-
## YouTube ASL Dataset Processing
15+
- `YT_CONFIG`: YouTube download settings (video quality, format, rate limits)
16+
- `LANGUAGE`: Supported language options for transcript retrieval
17+
- `FRAME_SKIP`: Controls frame sampling rate for efficient processing
18+
- `MAX_WORKERS`: Manages parallel processing to optimize performance
1219

13-
The processing of YouTube ASL dataset follows a systematic three-step approach, ensuring comprehensive data preparation:
20+
- `POSE_IDX`, `FACE_IDX`, `HAND_IDX`: Selected landmark indices for extracting the most relevant points for sign language analysis
1421

15-
### Step 1: Data Acquisition
22+
This centralized approach allows easy adaptation to different hardware capabilities or dataset requirements without modifying the core processing code.
23+
## How To Use?
24+
- **YouTube-ASL**: make sure the constant is correct in conf.py. Then, operate step 1 to step 3.
25+
- **How2Sign**: download **Green Screen RGB videos** and **English Translation (manually re-aligned)** from How2Sign website. Put the directory and .csv file in the right path or amend the path in the conf.py. then, operate step 3 only.
26+
27+
### Step 1: Data Acquisition (s1_data_downloader.py)
28+
**Necessary Constants:**`ID`, `VIDEO_DIR`, `TRANSCRIPT_DIR`, `YT_CONFIG`, `LANGUAGE`
29+
The script intelligently skips already downloaded content and implements rate limiting to prevent API throttling.
1630

17-
This initial phase combines two parallel processes:
1831

19-
The video downloader (`s1_video_download.py`) efficiently retrieves YouTube videos using yt-dlp, implementing smart rate limiting and quality control measures. It includes features for parallel fragment downloads and automatically skips previously downloaded content to prevent redundant processing.
2032

21-
Simultaneously, the transcript collector (`s1_transcript_downloader.py`) obtains video transcripts through the YouTube Transcript API. This component handles multiple English language variants and saves the transcripts in a structured JSON format, while maintaining appropriate rate limits to ensure reliable data collection.
33+
### Step 2: Transcript Processing (s2_transcript_preprocess.py)
34+
**Necessary Constants:** `ID`, `TRANSCRIPT_DIR`, `CSV_FILE`
35+
This step cleans text (converts Unicode characters, removes brackets), filters segments based on length and duration, and saves them with precise timestamps as tab-separated values.
2236

23-
### Step 2: Transcript Enhancement
2437

25-
The transcript processor (`s2_transcript_preprocess.py`) refines the raw transcript data into a format suitable for ASL translation. It performs sophisticated text normalization, including Unicode handling and ASCII conversion, while preserving semantic meaning. The system segments videos into overlapping chunks with precise timing information, generating well-structured CSV files containing the processed segments.
2638

27-
### Step 3: Feature Extraction
39+
### Step 3: Feature Extraction (s3_mediapipe_labelling.py)
40+
**Necessary Constants:** `CSV_FILE`, `VIDEO_DIR`, `OUTPUT_DIR`, `MAX_WORKERS`, `FRAME_SKIP`, `POSE_IDX`, `FACE_IDX`, `HAND_IDX`
41+
The script processes each video segment according to its timestamp, extracting only the most relevant body keypoints for sign language analysis. Results are saved as NumPy arrays.
2842

29-
The landmark extraction system (`s3_mediapipe_labelling.py`) utilizes the MediaPipe Holistic model to capture detailed movement data. It processes video segments to extract comprehensive pose, face, and hand landmarks, leveraging parallel processing capabilities for efficient computation. The extracted features are stored as numpy arrays for subsequent analysis and translation tasks.
43+
## Dataset Introduction
3044

31-
## How2Sign Dataset Processing
45+
### YouTube-ASL Dataset
46+
Video List: [https://github.com/google-research/google-research/blob/master/youtube_asl/README.md](https://github.com/google-research/google-research/blob/master/youtube_asl/README.md)
47+
Paper: ["YouTube-ASL: A Large-Scale, Open-Domain American Sign Language-English Parallel Corpus" (Uthus et al., 2023)](https://arxiv.org/abs/2306.15162).
3248

33-
For the How2Sign dataset, our system offers two specialized approaches for MediaPipe landmark extraction:
49+
If you use YouTube-ASL, please cite their associated paper:
3450

35-
### Clip-Based Processing
51+
```
52+
@misc{uthus2023youtubeasl,
53+
author = {Uthus, David and Tanzer, Garrett and Georg, Manfred},
54+
title = {YouTube-ASL: A Large-Scale, Open-Domain American Sign Language-English Parallel Corpus},
55+
year = {2023},
56+
eprint = {2306.15162},
57+
archivePrefix = {arXiv},
58+
url = {https://arxiv.org/abs/2306.15162},
59+
}
60+
```
3661

37-
The clip processor (`H2S_clip_mediapipe.py`) handles complete video clips in a single pass. It employs adaptive frame skipping to optimize processing speed while maintaining data quality. The system leverages parallel processing capabilities to handle multiple clips simultaneously, ensuring efficient resource utilization.
62+
### How2Sign Dataset
63+
Dataset: [https://how2sign.github.io/](https://how2sign.github.io/)
64+
Paper: [How2Sign: A Large-scale Multimodal Dataset for Continuous American Sign Language](https://openaccess.thecvf.com/content/CVPR2021/html/Duarte_How2Sign_A_Large-Scale_Multimodal_Dataset_for_Continuous_American_Sign_Language_CVPR_2021_paper.html)
3865

39-
### Raw Video Processing
66+
If you use How2Sign, please cite their associated paper:
4067

41-
The raw video processor (`H2S_raw_mediapipe.py`) takes a more granular approach, working with precise realigned timestamps from a CSV file. This method extracts landmarks for specific video segments, maintaining temporal accuracy while utilizing parallel processing for optimal performance.
42-
43-
## Data Organization
44-
45-
The system organizes processed data into clearly defined formats:
46-
- Video content is stored as MP4 files for optimal quality and compatibility
47-
- Transcripts are maintained in JSON format for easy parsing and manipulation
48-
- Segment information is organized in CSV files for straightforward analysis
49-
- Extracted landmarks are preserved as NumPy arrays (.npy files) for efficient processing
50-
51-
## Technical Requirements
52-
53-
The system relies on several key Python libraries:
54-
- OpenCV (cv2) for video processing
55-
- MediaPipe for pose and gesture recognition
56-
- NumPy for efficient numerical operations
57-
- Pandas for data manipulation
58-
- yt-dlp for video downloading
59-
- youtube-transcript-api for transcript retrieval
60-
61-
This preprocessing pipeline creates a robust foundation for ASL translation tasks, ensuring high-quality data preparation while maintaining processing efficiency.
68+
```
69+
@inproceedings{Duarte_CVPR2021,
70+
title={{How2Sign: A Large-scale Multimodal Dataset for Continuous American Sign Language}},
71+
author={Duarte, Amanda and Palaskar, Shruti and Ventura, Lucas and Ghadiyaram, Deepti and DeHaan, Kenneth and
72+
Metze, Florian and Torres, Jordi and Giro-i-Nieto, Xavier},
73+
booktitle={Conference on Computer Vision and Pattern Recognition (CVPR)},
74+
year={2021}
75+
}
76+
```

0 commit comments

Comments
 (0)