A subtitle generator for Japanese Adult Videos.
Transformer-based ASR architectures like Whisper suffer significant performance degradation when applied to the spontaneous and noisy domain of JAV. This degradation is driven by specific acoustic and temporal characteristics that defy the statistical distributions of standard training data.
JAV audio is defined by "acoustic hell" and a low Signal-to-Noise Ratio (SNR), characterized by:
- Non-Verbal Vocalisations (NVVs): A high density of physiological sounds (heavy breathing, gasps, sighs) and "obscene sounds" that lack clear harmonic structure.
- Spectral Mimicry: These vocalizations often possess "curve-like spectrum features" that mimic the formants of fricative consonants or Japanese syllables (e.g., fu), acting as accidental adversarial examples that trick the model into recognizing words where none exist.
- Extreme Dynamics: Volatile shifts in audio intensity, ranging from faint whispers (sasayaki) to high-decibel screams, which confuse standard gain control and attention mechanisms.
- Linguistic Variance: The prevalence of theatrical onomatopoeia and Role Language (Yakuwarigo) containing exaggerated intonations and slang absent from standard corpora.
While standard ASR models are typically trained on short, curated clips, JAV content comprises long-form media often exceeding 120 minutes. Research indicates that processing such extended inputs causes contextual drift and error accumulation. Specifically, extended periods of "ambiguous audio" (silence or rhythmic breathing) cause the Transformer's attention mechanism to collapse, triggering repetitive hallucination loops where the model generates unrelated text to fill the acoustic void.
Standard audio engineering intuition—such as aggressive denoising or vocal separation—often fails in this domain. Because Whisper relies on specific log-Mel spectrogram features, generic normalization tools can inadvertently strip high-frequency transients essential for distinguishing consonants, resulting in "domain shift" and erroneous transcriptions. Consequently, audio processing requires a "surgical," multi-stage approach (like VAD clamping) rather than blanket filtering.
Furthermore, while fine-tuning models on domain-specific data can be effective, it presents a high risk of overfitting. Due to the scarcity of high-quality, ethically sourced JAV datasets, fine-tuned models often become brittle, losing their generalization capabilities and leading to inconsistent "hit or miss" quality outputs.
WhisperJAV is an attempt to address above failure points. The inference pipelines do:
- Acoustic Filtering: Deploys scene-based segmentation and VAD clamping under the hypothesis that distinct scenes possess uniform acoustic characteristics, ensuring the model processes coherent audio environments rather than mixed streams [1-3].
- Linguistic Adaptation: Normalizes domain-specific terminology and preserves onomatopoeia, specifically correcting dialect-induced tokenization errors (e.g., in Kansai-ben) that standard BPE tokenizers fail to parse [4, 5].
- Defensive Decoding: Tunes log-probability thresholding and
no_speech_thresholdto systematically discard low-confidence outputs (hallucinations), while utilizing regex filters to clean non-lexical markers (e.g.,(moans)) from the final subtitle track [6, 7].
whisperjav-guiA window opens. Add your files, pick a mode, click Start.
# Basic usage
whisperjav video.mp4
# Specify mode and sensitivity
whisperjav audio.mp3 --mode balanced --sensitivity aggressive
# Process a folder
whisperjav /path/to/media_folder --output-dir ./subtitles| Mode | Backend | Scene Detection | VAD | Best For |
|---|---|---|---|---|
| faster | stable-ts (turbo) | No | No | Speed priority, clean audio |
| fast | stable-ts | Yes | No | General use, mixed quality |
| balanced | faster-whisper | Yes | Yes | Default. Noisy audio, dialogue-heavy |
| fidelity | OpenAI Whisper | Yes | Yes (Silero) | Maximum accuracy, slower |
| transformers | HuggingFace | Optional | Internal | Japanese-optimized model, customizable |
- Conservative: Higher thresholds, fewer hallucinations. Good for noisy content.
- Balanced: Default. Works for most content.
- Aggressive: Lower thresholds, catches more dialogue. Good for whisper/ASMR content.
Uses HuggingFace's kotoba-tech/kotoba-whisper-v2.2 model, which is optimized for Japanese conversational speech:
whisperjav video.mp4 --mode transformers
# Customize parameters
whisperjav video.mp4 --mode transformers --hf-beam-size 5 --hf-chunk-length 20Transformers-specific options:
--hf-model-id: Model (default:kotoba-tech/kotoba-whisper-v2.2)--hf-chunk-length: Seconds per chunk (default: 15)--hf-beam-size: Beam search width (default: 5)--hf-temperature: Sampling temperature (default: 0.0)--hf-scene: Scene detection method (none,auditok,silero)
Runs your video through two different pipelines and merges results. Different models catch different things.
# Pass 1 with transformers, Pass 2 with balanced
whisperjav video.mp4 --ensemble --pass1-pipeline transformers --pass2-pipeline balanced
# Custom sensitivity per pass
whisperjav video.mp4 --ensemble --pass1-pipeline balanced --pass1-sensitivity aggressive --pass2-pipeline fidelityMerge strategies:
smart_merge(default): Intelligent overlap detectionpass1_primary/pass2_primary: Prioritize one pass, fill gaps from otherfull_merge: Combine everything from both passes
The GUI has three tabs:
- Transcription Mode: Select pipeline, sensitivity, language
- Advanced Options: Model override, scene detection method, debug settings
- Two-Pass Ensemble: Configure both passes with full parameter customization via JSON editor
The Ensemble tab lets you customize beam size, temperature, VAD thresholds, and other ASR parameters without editing config files.
Generate subtitles and translate them in one step:
# Generate and translate
whisperjav video.mp4 --translate
# Or translate existing subtitles
whisperjav-translate -i subtitles.srt --provider deepseekSupports DeepSeek (cheap), Gemini (free tier), Claude, GPT-4, and OpenRouter.
Splits audio at natural breaks instead of forcing fixed-length chunks. This prevents cutting off sentences mid-word.
Identifies when someone is actually speaking vs. background noise or music. Reduces false transcriptions during quiet moments.
- Handles sentence-ending particles (ね, よ, わ, の)
- Preserves aizuchi (うん, はい, ええ)
- Recognizes dialect patterns (Kansai-ben, feminine/masculine speech)
- Filters out common Whisper hallucinations
Whisper sometimes generates repeated text or phrases that weren't spoken. WhisperJAV detects and removes these patterns.
| Content Type | Mode | Sensitivity | Notes |
|---|---|---|---|
| Drama / Dialogue Heavy | balanced | aggressive | Or try transformers mode |
| Group Scenes | faster | conservative | Speed matters, less precision needed |
| Amateur / Homemade | fast | conservative | Variable audio quality |
| ASMR / VR / Whisper | fidelity | aggressive | Maximum accuracy for quiet speech |
| Heavy Background Music | balanced | conservative | VAD helps filter music |
| Maximum Accuracy | ensemble | varies | Two-pass with different pipelines |
Download and run: WhisperJAV-1.7.1-Windows-x86_64.exe
This installs everything you need including Python and dependencies.
If you installed v1.5.x or v1.6.x via the Windows installer:
- Download upgrade_whisperjav.bat
- Double-click to run
- Wait 1-2 minutes
This updates WhisperJAV without re-downloading PyTorch (~2.5GB) or your AI models (~3GB).
Requires Python 3.9-3.12, FFmpeg, and Git.
# Install PyTorch with GPU support first (NVIDIA example)
pip install torch==2.5.1 torchvision==0.20.1 torchaudio==2.5.1 --index-url https://download.pytorch.org/whl/cu124
# Then install WhisperJAV
pip install git+https://github.com/meizhong986/whisperjav.git@mainPlatform Notes:
- Apple Silicon (M1/M2/M3/M4): Just
pip install torch torchvision torchaudio- MPS acceleration works automatically - AMD GPU (ROCm): Experimental. Use
--mode balancedfor best compatibility - CPU only: Works but slow. Use
--accept-cpu-modeto skip the GPU warning
- Python 3.9-3.12 (3.13+ not compatible with openai-whisper)
- FFmpeg in your system PATH
- GPU recommended: NVIDIA CUDA, Apple MPS, or AMD ROCm
- 8GB+ disk space for installation
Detailed Windows Prerequisites
- Install latest NVIDIA drivers
- Install CUDA Toolkit matching your driver version
- Install cuDNN matching your CUDA version
- Download from gyan.dev/ffmpeg/builds
- Extract to
C:\ffmpeg - Add
C:\ffmpeg\binto your PATH
Download from python.org. Check "Add Python to PATH" during installation.
# Basic usage
whisperjav video.mp4
whisperjav video.mp4 --mode balanced --sensitivity aggressive
# All modes: faster, fast, balanced, fidelity, transformers
whisperjav video.mp4 --mode fidelity
# Transformers mode with custom parameters
whisperjav video.mp4 --mode transformers --hf-beam-size 5 --hf-chunk-length 20
# Two-pass ensemble
whisperjav video.mp4 --ensemble --pass1-pipeline transformers --pass2-pipeline balanced
whisperjav video.mp4 --ensemble --pass1-pipeline balanced --pass2-pipeline fidelity --merge-strategy smart_merge
# Output options
whisperjav video.mp4 --output-dir ./subtitles
whisperjav video.mp4 --subs-language english-direct
# Debugging
whisperjav video.mp4 --debug --keep-temp
# Translation
whisperjav video.mp4 --translate --translate-provider deepseek
whisperjav-translate -i subtitles.srt --provider geminiRun whisperjav --help for all options.
FFmpeg not found: Install FFmpeg and add it to your PATH.
Slow processing / GPU warning: Your PyTorch might be CPU-only. Reinstall with GPU support:
pip uninstall torch torchvision torchaudio
pip install torch==2.5.1 torchvision==0.20.1 torchaudio==2.5.1 --index-url https://download.pytorch.org/whl/cu124model.bin error in faster mode: Enable Windows Developer Mode or run as Administrator, then delete the cached model folder:
Remove-Item -Recurse -Force "$env:USERPROFILE\.cache\huggingface\hub\models--Systran--faster-whisper-large-v2"Rough estimates for processing time per hour of video:
| Platform | Time |
|---|---|
| NVIDIA GPU (CUDA) | 5-10 minutes |
| Apple Silicon (MPS) | 8-15 minutes |
| AMD GPU (ROCm) | 10-20 minutes |
| CPU only | 30-60 minutes |
Contributions welcome. See CONTRIBUTING.md for guidelines.
git clone https://github.com/meizhong986/whisperjav.git
cd whisperjav
pip install -e .[dev]
python -m pytest tests/MIT License. See LICENSE file.
- OpenAI Whisper - The underlying ASR model
- stable-ts - Timestamp refinement
- faster-whisper - Optimized CTranslate2 inference
- HuggingFace Transformers - Transformers pipeline backend
- Kotoba-Whisper - Japanese-optimized Whisper model
- The testing community for feedback and bug reports
This tool generates accessibility subtitles. Users are responsible for compliance with applicable laws regarding the content they process.