🤖 Claude Collector

One command to extract all your Claude Code conversations for training datasets.

Quick Start

Install and run with uvx (recommended)

uvx claude-collector

That's it! The tool will:

✅ Auto-find your Claude Code data (~/.claude/projects)
✅ Extract all conversations
✅ Sanitize PII (emails, API keys, paths)
✅ Count total tokens
✅ Save as training-ready JSONL

Or install globally

uv tool install claude-collector
claude-collector

What It Does

Scans your Claude Code session files and:

Finds all conversation data in ~/.claude/projects
Extracts user/assistant message pairs
Sanitizes sensitive information:
- Emails → [EMAIL]
- API keys → [API_KEY]
- File paths → /Users/[USER]/...
- IP addresses → [IP]
- OAuth tokens → [REDACTED]
Counts actual token usage
Saves as clean JSONL dataset

Example Output

🤖 Claude Collector v0.1.0
Extract & sanitize Claude Code conversations

✓ Found Claude data: /Users/z/.claude/projects

📂 Processing 1394 files...

╭─────────────────────┬──────────────╮
│ Metric              │ Value        │
├─────────────────────┼──────────────┤
│ Files scanned       │ 1,394        │
│ Files with data     │ 1,273        │
│ Total messages      │ 46,029       │
│ Training examples   │ 3,653        │
│ Total tokens        │ 4.04B        │
╰─────────────────────┴──────────────╯

✅ Dataset saved!
   File: claude_dataset_20251113.jsonl
   Size: 19.13 MB
   Examples: 3,653

🎉 Ready for training!

Options

# Dry run (see stats without saving)
uvx claude-collector --dry-run

# Custom output location
uvx claude-collector --output ~/my-dataset.jsonl

# Specify input directory
uvx claude-collector --input ~/.config/claude/projects

# Filter by minimum tokens
uvx claude-collector --min-tokens 1000

# Skip sanitization (NOT recommended for sharing!)
uvx claude-collector --no-sanitize

Use Cases

1. Create Training Dataset

uvx claude-collector --output training-data.jsonl

2. Audit Your Usage

uvx claude-collector --dry-run

Shows total tokens without saving.

3. Collect from Multiple Machines

On each computer:

# Machine 1
uvx claude-collector --output machine1-data.jsonl

# Machine 2  
uvx claude-collector --output machine2-data.jsonl

# Combine later
cat machine1-data.jsonl machine2-data.jsonl > combined-dataset.jsonl

4. Add to Existing Dataset

uvx claude-collector --output new-sessions.jsonl
cat existing-dataset.jsonl new-sessions.jsonl > updated-dataset.jsonl

Output Format

Each line is a JSON object:

{
  "messages": [
    {"role": "user", "content": "How do I..."},
    {"role": "assistant", "content": "You can..."}
  ],
  "metadata": {
    "timestamp": "2025-11-13T...",
    "tokens": {
      "input_tokens": 100,
      "output_tokens": 200,
      "cache_creation_input_tokens": 5000,
      "cache_read_input_tokens": 1000
    }
  }
}

Perfect for:

Fine-tuning LLMs
Training coding assistants
Building instruction datasets
Analysis and research

Finding Claude Data

Default Locations

~/.claude/projects/              # Primary
~/.config/claude/projects/       # Alternative

Check All Users

ls -la /Users/*/.claude/projects     # macOS
ls -la /home/*/.claude/projects      # Linux

Find Anywhere

find ~ -name "*.jsonl" -path "*/.claude/*" 2>/dev/null

Privacy & Security

⚠️ Important: Claude Code logs contain sensitive data!

The tool comprehensively sanitizes:

🔐 API Keys & Tokens:

✅ OpenAI API keys (sk-*)
✅ Anthropic API keys (sk-ant-*)
✅ GitHub tokens (ghp_*, gho_*, ghs_*, 40-char hex)
✅ HuggingFace tokens (hf_*)
✅ Slack tokens (xoxb-*, xoxp-*, xoxe-*)
✅ AWS credentials (AKIA*, secrets)
✅ JWT tokens
✅ Google API keys & OAuth
✅ SendGrid API keys
✅ Stripe keys (live, test, restricted)
✅ Square tokens
✅ Facebook/Twitter tokens

💰 Financial & Crypto:

✅ Credit card numbers
✅ Crypto seed phrases (BIP39)
✅ Ethereum/Bitcoin addresses
✅ Crypto private keys (64 hex)

🔑 Authentication:

✅ Private keys (PEM format)
✅ Database URLs with passwords
✅ Generic password patterns
✅ OAuth credentials

👤 Personal Information:

✅ Email addresses
✅ Social Security Numbers (US)
✅ Phone numbers
✅ IP addresses (last octet redacted)
✅ File paths (username redacted)

Still check before sharing:

Project names (if sensitive)
Company-specific terminology
Proprietary code patterns

For maximum privacy, review the output file before uploading anywhere.

Requirements

Python 3.8+
Claude Code installed (for data to exist)

Installation Methods

1. uvx (easiest, no install)

uvx claude-collector

2. uv tool (global install)

uv tool install claude-collector
claude-collector

3. pip

pip install claude-collector
claude-collector

4. From source

git clone https://github.com/hanzoai/claude-collector
cd claude-collector
uv pip install -e .
claude-collector

Troubleshooting

"No Claude Code data found"

Make sure Claude Code is installed
Check you've had at least one session
Try specifying path: --input ~/.claude/projects

"Only found a few conversations"

This is normal if you're new to Claude Code
Each session creates one file
More usage = more data

"Tokens show 0"

Some messages don't have usage tracking
This is normal for system messages
Real conversations will have token counts

Advanced: Custom Processing

import json

# Read dataset
with open('claude_dataset.jsonl', 'r') as f:
    for line in f:
        example = json.loads(line)
        
        # Access messages
        user_msg = example['messages'][0]['content']
        assistant_msg = example['messages'][1]['content']
        
        # Access metadata
        tokens = example['metadata']['tokens']
        timestamp = example['metadata']['timestamp']
        
        # Your custom processing here

License

MIT - Free to use for any purpose

Credits

Built by Hanzo AI for the AI development community.

Found a bug? Open an issue: https://github.com/hanzoai/claude-collector/issues Want to contribute? PRs welcome!

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
.github/workflows		.github/workflows
claude_collector		claude_collector
tests		tests
INSTALL.md		INSTALL.md
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

🤖 Claude Collector

Quick Start

Install and run with uvx (recommended)

Or install globally

What It Does

Example Output

Options

Use Cases

1. Create Training Dataset

2. Audit Your Usage

3. Collect from Multiple Machines

4. Add to Existing Dataset

Output Format

Finding Claude Data

Default Locations

Check All Users

Find Anywhere

Privacy & Security

Requirements

Installation Methods

1. uvx (easiest, no install)

2. uv tool (global install)

3. pip

4. From source

Troubleshooting

Advanced: Custom Processing

License

Credits

About

Uh oh!

Releases

Packages

Languages

License

hanzoai/claude-collector

Folders and files

Latest commit

History

Repository files navigation

🤖 Claude Collector

Quick Start

Install and run with uvx (recommended)

Or install globally

What It Does

Example Output

Options

Use Cases

1. Create Training Dataset

2. Audit Your Usage

3. Collect from Multiple Machines

4. Add to Existing Dataset

Output Format

Finding Claude Data

Default Locations

Check All Users

Find Anywhere

Privacy & Security

Requirements

Installation Methods

1. uvx (easiest, no install)

2. uv tool (global install)

3. pip

4. From source

Troubleshooting

Advanced: Custom Processing

License

Credits

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages