One command to extract all your Claude Code conversations for training datasets.
uvx claude-collectorThat's it! The tool will:
- ✅ Auto-find your Claude Code data (
~/.claude/projects) - ✅ Extract all conversations
- ✅ Sanitize PII (emails, API keys, paths)
- ✅ Count total tokens
- ✅ Save as training-ready JSONL
uv tool install claude-collector
claude-collectorScans your Claude Code session files and:
- Finds all conversation data in
~/.claude/projects - Extracts user/assistant message pairs
- Sanitizes sensitive information:
- Emails →
[EMAIL] - API keys →
[API_KEY] - File paths →
/Users/[USER]/... - IP addresses →
[IP] - OAuth tokens →
[REDACTED]
- Emails →
- Counts actual token usage
- Saves as clean JSONL dataset
🤖 Claude Collector v0.1.0
Extract & sanitize Claude Code conversations
✓ Found Claude data: /Users/z/.claude/projects
📂 Processing 1394 files...
╭─────────────────────┬──────────────╮
│ Metric │ Value │
├─────────────────────┼──────────────┤
│ Files scanned │ 1,394 │
│ Files with data │ 1,273 │
│ Total messages │ 46,029 │
│ Training examples │ 3,653 │
│ Total tokens │ 4.04B │
╰─────────────────────┴──────────────╯
✅ Dataset saved!
File: claude_dataset_20251113.jsonl
Size: 19.13 MB
Examples: 3,653
🎉 Ready for training!
# Dry run (see stats without saving)
uvx claude-collector --dry-run
# Custom output location
uvx claude-collector --output ~/my-dataset.jsonl
# Specify input directory
uvx claude-collector --input ~/.config/claude/projects
# Filter by minimum tokens
uvx claude-collector --min-tokens 1000
# Skip sanitization (NOT recommended for sharing!)
uvx claude-collector --no-sanitizeuvx claude-collector --output training-data.jsonluvx claude-collector --dry-runShows total tokens without saving.
On each computer:
# Machine 1
uvx claude-collector --output machine1-data.jsonl
# Machine 2
uvx claude-collector --output machine2-data.jsonl
# Combine later
cat machine1-data.jsonl machine2-data.jsonl > combined-dataset.jsonluvx claude-collector --output new-sessions.jsonl
cat existing-dataset.jsonl new-sessions.jsonl > updated-dataset.jsonlEach line is a JSON object:
{
"messages": [
{"role": "user", "content": "How do I..."},
{"role": "assistant", "content": "You can..."}
],
"metadata": {
"timestamp": "2025-11-13T...",
"tokens": {
"input_tokens": 100,
"output_tokens": 200,
"cache_creation_input_tokens": 5000,
"cache_read_input_tokens": 1000
}
}
}Perfect for:
- Fine-tuning LLMs
- Training coding assistants
- Building instruction datasets
- Analysis and research
~/.claude/projects/ # Primary
~/.config/claude/projects/ # Alternativels -la /Users/*/.claude/projects # macOS
ls -la /home/*/.claude/projects # Linuxfind ~ -name "*.jsonl" -path "*/.claude/*" 2>/dev/nullThe tool comprehensively sanitizes:
🔐 API Keys & Tokens:
- ✅ OpenAI API keys (
sk-*) - ✅ Anthropic API keys (
sk-ant-*) - ✅ GitHub tokens (
ghp_*,gho_*,ghs_*, 40-char hex) - ✅ HuggingFace tokens (
hf_*) - ✅ Slack tokens (
xoxb-*,xoxp-*,xoxe-*) - ✅ AWS credentials (AKIA*, secrets)
- ✅ JWT tokens
- ✅ Google API keys & OAuth
- ✅ SendGrid API keys
- ✅ Stripe keys (live, test, restricted)
- ✅ Square tokens
- ✅ Facebook/Twitter tokens
💰 Financial & Crypto:
- ✅ Credit card numbers
- ✅ Crypto seed phrases (BIP39)
- ✅ Ethereum/Bitcoin addresses
- ✅ Crypto private keys (64 hex)
🔑 Authentication:
- ✅ Private keys (PEM format)
- ✅ Database URLs with passwords
- ✅ Generic password patterns
- ✅ OAuth credentials
👤 Personal Information:
- ✅ Email addresses
- ✅ Social Security Numbers (US)
- ✅ Phone numbers
- ✅ IP addresses (last octet redacted)
- ✅ File paths (username redacted)
Still check before sharing:
- Project names (if sensitive)
- Company-specific terminology
- Proprietary code patterns
For maximum privacy, review the output file before uploading anywhere.
- Python 3.8+
- Claude Code installed (for data to exist)
uvx claude-collectoruv tool install claude-collector
claude-collectorpip install claude-collector
claude-collectorgit clone https://github.com/hanzoai/claude-collector
cd claude-collector
uv pip install -e .
claude-collector"No Claude Code data found"
- Make sure Claude Code is installed
- Check you've had at least one session
- Try specifying path:
--input ~/.claude/projects
"Only found a few conversations"
- This is normal if you're new to Claude Code
- Each session creates one file
- More usage = more data
"Tokens show 0"
- Some messages don't have usage tracking
- This is normal for system messages
- Real conversations will have token counts
import json
# Read dataset
with open('claude_dataset.jsonl', 'r') as f:
for line in f:
example = json.loads(line)
# Access messages
user_msg = example['messages'][0]['content']
assistant_msg = example['messages'][1]['content']
# Access metadata
tokens = example['metadata']['tokens']
timestamp = example['metadata']['timestamp']
# Your custom processing hereMIT - Free to use for any purpose
Built by Hanzo AI for the AI development community.
Found a bug? Open an issue: https://github.com/hanzoai/claude-collector/issues Want to contribute? PRs welcome!