Universal Scraper

The Python package for scraping data from any website

How Universal Scraper Works
Live Working Example
How It Works
Smart HTML Cleaner
Installation (Recommended)
Installation
Quick Start
- 1. Set up your API key
- 2. Basic Usage
- 3. Convenience Function
Export Formats
CLI Usage
MCP Server Usage
Cache Management
Advanced Usage
API Reference
Output Format
Common Field Examples
Multi-Provider AI Support
Troubleshooting
Roadmap
Contributors
Contributing
License
Changelog

A Python module for AI-powered web scraping with customizable field extraction using multiple AI providers (Gemini, OpenAI, Anthropic, and more via LiteLLM).

Motivation for This Module

Traditionally, Developers have to write the Web Scraper manually using technologies such as requests/cloudscraper/selenium (in Python) or Axios/Cheerio/Puppeteer/selenium (in JS)
We need to write BeautifulSoup4 selectors using xpath/class/id etc, by analysing the HTML
Even slight change in HTML structure breaks the scrapers, this fragility means scrapers require constant, time-consuming maintenance and frequent rewrites to remain functional
Writing end to end web scrapers from fetching HTML to Parsing it and then exporting that data in JSON or CSV is time consuming
How about a module, which can write BeautifulSoup4 code on the fly by Analysing 98%+ less sized HTML structure, then use that extraction code for subsequent pages who have same HTML structure
A module which only regenerates the Beautifulsoup4 code, only if the HTML structure is changed
A module which can do couple of hours of web scraping task in just 5 seconds and still costing less than 0.7 cents (~$0.00786) on LLM API Calls (for generating Extraction code only)

How Universal Scraper Works

graph TB
    A[🌐 Input URL] --> B[📥 HTML Fetcher]
    B --> B1[CloudScraper Anti-Bot Protection]
    B1 --> C[🧹 Smart HTML Cleaner]
    
    C --> C1[Remove Scripts & Styles]
    C1 --> C2[Remove Ads & Analytics]
    C2 --> C2a[Remove Inline SVG Images]
    C2a --> C2aa[Replace URLs with Placeholders]
    C2aa --> C2b[Remove Non-Essential HTML Attributes]
    C2b --> C3[Remove Navigation Elements]
    C3 --> C4[Detect Repeating Structures]
    C4 --> C5[Keep 2 Samples, Remove Others]
    C5 --> C6[Remove Empty Divs]
    C6 --> D[📊 98% Size Reduction]
    
    D --> D1[🔗 Generate Structural Hash]
    D1 --> E{🔍 Check Code Cache}
    E -->|Cache Hit & Hash Match| F[♻️ Use Cached Code]
    E -->|Cache Miss or Hash Changed| E1[🗑️ Discard Old Cache]
    E1 --> G[🤖 AI Code Generation]
    
    G --> G1[🧠 Choose AI Provider]
    G1 --> G2[Gemini 2.5-Flash Default]
    G1 --> G3[OpenAI GPT-4/GPT-4o]
    G1 --> G4[Claude 3 Opus/Sonnet/Haiku]
    G1 --> G5[100+ Other Models via LiteLLM]
    
    G2 --> H[📝 Generate BeautifulSoup Code]
    G3 --> H
    G4 --> H
    G5 --> H
    
    H --> I[💾 Cache Generated Code + Hash]
    F --> J[⚡ Execute Code on Original HTML]
    I --> J
    
    J --> K[📋 Extract Structured Data]
    K --> L{📁 Output Format}
    L -->|JSON| M[💾 Save as JSON]
    L -->|CSV| N[📊 Save as CSV]
    
    M --> O[✅ Complete with Metadata]
    N --> O
    
    style A fill:#e1f5fe
    style D fill:#4caf50,color:#fff
    style D1 fill:#ff5722,color:#fff
    style E fill:#ff9800,color:#fff
    style E1 fill:#f44336,color:#fff
    style F fill:#4caf50,color:#fff
    style G1 fill:#9c27b0,color:#fff
    style O fill:#2196f3,color:#fff

Key Performance Benefits:

98% HTML Size Reduction → Massive token savings
Smart Caching → 90%+ API cost reduction on repeat scraping
Multi-Provider Support → Choose the best AI for your use case, 100+ LLMs supported
Dual HTML Processing → Clean HTML and reduces HTML size upto 98.3%+ for AI analysis, original HTML for complete data extraction
Generates BeautifulSoup4 code on the fly → Generates structural hash of HTML page, so that it reuse extraction code on repeat scraping

Token Count Comparison (Claude Sonnet 4):

2,619 tokens: ~$0.00786 (0.8 cents)
150,742 tokens: ~$0.45 (45 cents)
Token ratio: 150,742 ÷ 2,619 = 57.5x more tokens
Saving: The larger request costs 57.5x more than the smaller one

Live Working Example

Here's a real working example showing Universal Scraper in action with Gemini 2.5 Pro:

>>> from universal_scraper import UniversalScraper
>>> scraper = UniversalScraper(api_key="AIzxxxxxxxxxxxxxxxxxxxxx", model_name="gemini-2.5-pro")
2025-09-17 01:22:30 - code_cache - INFO - CodeCache initialized with database: temp/extraction_cache.db
2025-09-17 01:22:30 - data_extractor - INFO - Code caching enabled
2025-09-17 01:22:30 - data_extractor - INFO - Using Google Gemini API with model: gemini-2.5-pro
2025-09-17 01:22:30 - data_extractor - INFO - Initialized DataExtractor with model: gemini-2.5-pro

>>> # Set fields for e-commerce laptop scraping
>>> scraper.set_fields(["product_name", "product_price", "product_rating", "product_description", "availability"])
2025-09-17 01:22:31 - universal_scraper - INFO - Extraction fields updated: ['product_name', 'product_price', 'product_rating', 'product_description', 'availability']

>>> result = scraper.scrape_url("https://webscraper.io/test-sites/e-commerce/allinone/computers/laptops", save_to_file=True, format='csv')
2025-09-17 01:22:33 - universal_scraper.scraper - INFO - Starting scraping for: https://webscraper.io/test-sites/e-commerce/allinone/computers/laptops
2025-09-17 01:22:33 - universal_scraper.core.html_fetcher - INFO - Starting to fetch HTML for: https://webscraper.io/test-sites/e-commerce/allinone/computers/laptops
2025-09-17 01:22:33 - universal_scraper.core.html_fetcher - INFO - Fetching https://webscraper.io/test-sites/e-commerce/allinone/computers/laptops with cloudscraper...
2025-09-17 01:22:33 - universal_scraper.core.html_fetcher - INFO - Successfully fetched content with cloudscraper. Length: 163496
2025-09-17 01:22:33 - universal_scraper.core.html_fetcher - INFO - Successfully fetched HTML with cloudscraper
2025-09-17 01:22:33 - universal_scraper.core.cleaning.base_cleaner - INFO - Starting HTML cleaning process...
2025-09-17 01:22:33 - universal_scraper.core.cleaning.base_cleaner - INFO - Removed noise. Length: 142614
2025-09-17 01:22:33 - universal_scraper.core.cleaning.base_cleaner - INFO - Removed SVG/images. Length: 142614
2025-09-17 01:22:33 - universal_scraper.core.cleaning.base_cleaner - INFO - Replaced 252 URL sources with placeholders.
2025-09-17 01:22:33 - universal_scraper.core.cleaning.base_cleaner - INFO - Replaced URL sources. Length: 133928
2025-09-17 01:22:33 - universal_scraper.core.cleaning.base_cleaner - INFO - Removed iframes. Length: 133928
2025-09-17 01:22:33 - universal_scraper.core.cleaning.base_cleaner - INFO - Removed headers/footers. Length: 127879
2025-09-17 01:22:33 - universal_scraper.core.cleaning.base_cleaner - INFO - Focused on main content. Length: 127642
2025-09-17 01:22:33 - universal_scraper.core.cleaning.base_cleaner - INFO - Limited select options. Length: 127642
2025-09-17 01:22:33 - universal_scraper.core.cleaning.base_cleaner - INFO - Removed 3 empty div elements in 1 iterations
2025-09-17 01:22:33 - universal_scraper.core.cleaning.base_cleaner - INFO - Removed empty divs. Length: 127553
2025-09-17 01:22:33 - universal_scraper.core.cleaning.base_cleaner - INFO - Collapsed 117 long text nodes
2025-09-17 01:22:33 - universal_scraper.core.cleaning.base_cleaner - INFO - Collapsed long text nodes. Length: 123068
2025-09-17 01:22:33 - universal_scraper.core.cleaning.base_cleaner - INFO - Removed 0 non-essential attributes (3560 → 3560)
2025-09-17 01:22:33 - universal_scraper.core.cleaning.base_cleaner - INFO - Removed non-essential attributes. Length: 123068
2025-09-17 01:22:33 - universal_scraper.core.cleaning.base_cleaner - INFO - Removed whitespace between tags. Length: 123068 → 118089 (4.0% reduction)
2025-09-17 01:22:33 - universal_scraper.core.cleaning.base_cleaner - INFO - Found 117 similar structures, keeping 2, removing 115
2025-09-17 01:22:33 - universal_scraper.core.cleaning.base_cleaner - INFO - Found 117 similar structures, keeping 2, removing 115
2025-09-17 01:22:33 - universal_scraper.core.cleaning.base_cleaner - INFO - Found 117 similar structures, keeping 2, removing 115
2025-09-17 01:22:33 - universal_scraper.core.cleaning.base_cleaner - INFO - Found 117 similar structures, keeping 2, removing 115
2025-09-17 01:22:33 - universal_scraper.core.cleaning.base_cleaner - INFO - Removed 115 repeating structure elements
2025-09-17 01:22:33 - universal_scraper.core.cleaning.base_cleaner - INFO - Removed repeating structures. Length: 2371
2025-09-17 01:22:33 - universal_scraper.core.cleaning.base_cleaner - INFO - Removed 0 empty div elements in 0 iterations
2025-09-17 01:22:33 - universal_scraper.core.cleaning.base_cleaner - INFO - Removed empty divs (post-compression). Length: 2371
2025-09-17 01:22:33 - universal_scraper.core.cleaning.base_cleaner - INFO - HTML cleaning completed. Original: 150742, Final: 2371
2025-09-17 01:22:33 - universal_scraper.core.cleaning.base_cleaner - INFO - Reduction: 98.4%
2025-09-17 01:22:33 - data_extractor - INFO - Using HTML separation: cleaned for code generation, original for execution
2025-09-17 01:22:33 - code_cache - INFO - Cache MISS for https://webscraper.io/test-sites/e-commerce/allinone/computers/laptops
2025-09-17 01:22:33 - data_extractor - INFO - Generating BeautifulSoup code with gemini-2.5-pro for fields: ['product_name', 'product_price', 'product_rating', 'product_description', 'availability']
2025-09-17 01:22:37 - code_cache - INFO - Code cached for https://webscraper.io/test-sites/e-commerce/allinone/computers/laptops (hash: bd0ed6e62683fcfb...)
2025-09-17 01:22:37 - data_extractor - INFO - Successfully generated BeautifulSoup code
2025-09-17 01:22:37 - data_extractor - INFO - Executing generated extraction code...
2025-09-17 01:22:37 - data_extractor - INFO - Successfully extracted data with 117 items
2025-09-17 01:22:37 - universal_scraper - INFO - Successfully extracted data from https://webscraper.io/test-sites/e-commerce/allinone/computers/laptops
>>>

# ✨ Results: 117 laptop products extracted from 163KB HTML in ~5 seconds!
# 🎯 98.4% HTML size reduction (163KB → 2.3KB for AI processing to generate BeautifulSoup4 code)  
# 💾 Data automatically saved as CSV with product_name, product_price, product_rating, etc.

What Just Happened:

Fields Configured for e-commerce: product_name, product_price, product_rating, etc.
HTML Fetched with anti-bot protection (163KB)
Smart Cleaning reduced size by 98.4% (163KB → 2.3KB)
AI Generated custom extraction code using GPT-4o for specified fields
Code Cached for future use (90% cost savings on re-runs)
117 Laptop Products Extracted from original HTML with complete data
Saved as CSV ready for analysis with all specified product fields

How It Works

HTML Fetching: Uses cloudscraper or selenium to fetch HTML content, handling anti-bot measures
Smart HTML Cleaning: Removes 98%+ of noise (scripts, ads, navigation, repeated structures, empty divs) while preserving data structure
Structure-Based Caching: Creates structural hash and checks cache for existing extraction code
AI Code Generation: Uses your chosen AI provider (Gemini, OpenAI, Claude, etc.) to generate custom BeautifulSoup code on cleaned HTML (only when not cached)
Code Execution: Runs the cached/generated code on original HTML to extract ALL data items
Export Output data as Json/CSV: Returns complete, consistent, structured data with metadata and performance stats

Smart HTML Cleaner

What Gets Removed

Scripts & Styles: JavaScript, CSS, and style blocks
Ads & Analytics: Advertisement content and tracking scripts
Navigation: Headers, footers, sidebars, and menu elements
Metadata: Meta tags, SEO tags, and hidden elements
Empty Elements: Recursively removes empty div elements that don't contain meaningful content
Noise: Comments, unnecessary attributes, and whitespace
Inline SVG Image: Makes the page size bulky
URL Placeholders: Replaces long URLs (src, href, action) with short placeholders like [IMG_URL], [LINK_URL] to reduce token count
Non Essential Attributes Remover: It Distinguishes between essential attributes (id, class, href, data-price) and non-essential ones (style, onclick, data-analytics)
Whitespace & Blank Lines Remover: It Compresses the final HTML, before sending it to LLM for analysis

Repeating Structure Reduction

The cleaner intelligently detects and reduces repeated HTML structures:

Pattern Detection: Uses structural hashing + similarity algorithms to find repeated elements
Smart Sampling: Keeps 2 samples from groups of 3+ similar structures (e.g., 20 job cards → 2 samples)
Structure Preservation: Maintains document flow and parent-child relationships
AI Optimization: Provides enough samples for pattern recognition without overwhelming the AI

Empty Element Removal

The cleaner intelligently removes empty div elements:

Recursive Processing: Starts from innermost divs and works outward
Content Detection: Preserves divs with text, images, inputs, or interactive elements
Structure Preservation: Maintains parent-child relationships and avoids breaking important structural elements
Smart Analysis: Removes placeholder/skeleton divs while keeping functional containers

Example: Removes empty animation placeholders like <div class="animate-pulse"></div> while preserving divs containing actual content.

Installation (Recommended)

pip install universal-scraper

Installation (Global level on Mac)

brew install pipx
sudo pipx install "universal-scraper[mcp]" --global

Installation

Clone the repository:

git clone <repository-url>
cd Universal_Scrapper

Install dependencies:

pip install -r requirements.txt

Or install manually:

pip install google-generativeai beautifulsoup4 requests selenium lxml fake-useragent

Install the module:
```
pip install -e .
```

Quick Start

1. Set up your API key

Option A: Use Gemini (Default - Recommended) Get a Gemini API key from Google AI Studio:

export GEMINI_API_KEY="your_gemini_api_key_here"

Option B: Use OpenAI

export OPENAI_API_KEY="your_openai_api_key_here"

Option C: Use Anthropic Claude

export ANTHROPIC_API_KEY="your_anthropic_api_key_here"

Option D: Pass API key directly

# For any provider - just pass the API key directly
scraper = UniversalScraper(api_key="your_api_key")

2. Basic Usage

from universal_scraper import UniversalScraper

# Option 1: Auto-detect provider (uses Gemini by default)
scraper = UniversalScraper(api_key="your_gemini_api_key")

# Option 2: Specify Gemini model explicitly
scraper = UniversalScraper(api_key="your_gemini_api_key", model_name="gemini-2.5-flash")

# Option 3: Use OpenAI
scraper = UniversalScraper(api_key="your_openai_api_key", model_name="gpt-4")

# Option 4: Use Anthropic Claude
scraper = UniversalScraper(api_key="your_anthropic_api_key", model_name="claude-3-sonnet-20240229")

# Option 5: Use any other provider supported by LiteLLM
scraper = UniversalScraper(api_key="your_api_key", model_name="llama-2-70b-chat")

# Set the fields you want to extract
scraper.set_fields([
    "company_name", 
    "job_title", 
    "apply_link", 
    "salary_range",
    "location"
])

# Check current model
print(f"Using model: {scraper.get_model_name()}")

# Scrape a URL (default JSON format)
result = scraper.scrape_url("https://example.com/jobs", save_to_file=True)

print(f"Extracted {result['metadata']['items_extracted']} items")
print(f"Data saved to: {result.get('saved_to')}")

# Scrape and save as CSV
result = scraper.scrape_url("https://example.com/jobs", save_to_file=True, format='csv')
print(f"CSV data saved to: {result.get('saved_to')}")

3. Convenience Function

For quick one-off scraping:

from universal_scraper import scrape

# Quick scraping with default JSON format
data = scrape(
    url="https://example.com/jobs",
    api_key="your_gemini_api_key",
    fields=["company_name", "job_title", "apply_link"]
)

# Quick scraping with CSV format
data = scrape(
    url="https://example.com/jobs",
    api_key="your_gemini_api_key",
    fields=["company_name", "job_title", "apply_link"],
    format="csv"
)

# Quick scraping with OpenAI
data = scrape(
    url="https://example.com/jobs",
    api_key="your_openai_api_key",
    fields=["company_name", "job_title", "apply_link"],
    model_name="gpt-4"
)

# Quick scraping with Anthropic Claude
data = scrape(
    url="https://example.com/jobs",
    api_key="your_anthropic_api_key",
    fields=["company_name", "job_title", "apply_link"],
    model_name="claude-3-haiku-20240307"
)

print(data['data'])  # The extracted data

Export Formats

Universal Scraper supports multiple output formats to suit your data processing needs:

JSON Export (Default)

# JSON is the default format
result = scraper.scrape_url("https://example.com/jobs", save_to_file=True)
# or explicitly specify
result = scraper.scrape_url("https://example.com/jobs", save_to_file=True, format='json')

JSON Output Structure:

{
  "url": "https://example.com",
  "timestamp": "2025-01-01T12:00:00",
  "fields": ["company_name", "job_title", "apply_link"],
  "data": [
    {
      "company_name": "Example Corp",
      "job_title": "Software Engineer", 
      "apply_link": "https://example.com/apply/123"
    }
  ],
  "metadata": {
    "raw_html_length": 50000,
    "cleaned_html_length": 15000,
    "items_extracted": 1
  }
}

CSV Export

# Export as CSV for spreadsheet analysis
result = scraper.scrape_url("https://example.com/jobs", save_to_file=True, format='csv')

CSV Output:

Clean tabular format with headers
All fields as columns, missing values filled with empty strings
Perfect for Excel, Google Sheets, or pandas processing
Automatically handles varying field structures across items

Multiple URLs with Format Choice

urls = ["https://site1.com", "https://site2.com", "https://site3.com"]

# Save all as JSON (default)
results = scraper.scrape_multiple_urls(urls, save_to_files=True)

# Save all as CSV
results = scraper.scrape_multiple_urls(urls, save_to_files=True, format='csv')

CLI Usage

# Gemini (default) - auto-detects from environment
universal-scraper https://example.com/jobs --output jobs.json

# OpenAI GPT models
universal-scraper https://example.com/products --api-key YOUR_OPENAI_KEY --model gpt-4 --format csv

# Anthropic Claude models  
universal-scraper https://example.com/data --api-key YOUR_ANTHROPIC_KEY --model claude-3-haiku-20240307

# Custom fields extraction
universal-scraper https://example.com/listings --fields product_name product_price product_rating

# Batch processing multiple URLs
universal-scraper --urls urls.txt --output-dir results --format csv --model gpt-4o-mini

# Verbose logging with any provider
universal-scraper https://example.com --api-key YOUR_KEY --model gpt-4 --verbose

🔧 Advanced CLI Options:

# Set custom extraction fields
universal-scraper URL --fields title price description availability

# Use environment variables (auto-detected)
export OPENAI_API_KEY="your_key"
universal-scraper URL --model gpt-4

# Multiple output formats
universal-scraper URL --format json    # Default
universal-scraper URL --format csv     # Spreadsheet-ready

# Batch processing
echo -e "https://site1.com\nhttps://site2.com" > urls.txt
universal-scraper --urls urls.txt --output-dir batch_results

🔗 Provider Support: All 100+ models supported by LiteLLM work in CLI! See LiteLLM Providers for complete list.

Development Usage (from cloned repo):

python main.py https://example.com/jobs --api-key YOUR_KEY --model gpt-4

MCP Server Usage

Universal Scraper works as an MCP (Model Context Protocol) server, allowing AI assistants to scrape websites directly.

Quick Setup

Install with MCP support:

pip install universal-scraper

Set your AI API key:

export GEMINI_API_KEY="your_key"  # or OPENAI_API_KEY, ANTHROPIC_API_KEY

Claude Code Setup

Add this to your Claude Code MCP settings:

{
  "mcpServers": {
    "universal-scraper": {
      "command": "universal-scraper-mcp"
    }
  }
}

or Run this command in your terminal

claude mcp add universal-scraper universal-scraper-mcp

Cursor Setup

Add this to your Cursor MCP configuration:

{
  "mcpServers": {
    "universal-scraper": {
      "command": "universal-scraper-mcp"
    }
  }
}

Available Tools

scrape_url: Scrape a single URL
scrape_multiple_urls: Scrape multiple URLs
configure_scraper: Set API keys and models
get_scraper_info: Check current settings
clear_cache: Clear cached data

Example Usage

Once configured, just ask your AI assistant:

"Scrape https://news.ycombinator.com and extract the top story titles and links"

"Scrape this product page and get the price, name, and reviews"

Cache Management

scraper = UniversalScraper(api_key="your_key")

# View cache statistics
stats = scraper.get_cache_stats()
print(f"Cached entries: {stats['total_entries']}")
print(f"Total cache hits: {stats['total_uses']}")

# Clear old entries (30+ days)
removed = scraper.cleanup_old_cache(30)
print(f"Removed {removed} old entries")

# Clear entire cache
scraper.clear_cache()

# Disable/enable caching
scraper.disable_cache()  # For testing
scraper.enable_cache()   # Re-enable

Advanced Usage

Multiple URLs

scraper = UniversalScraper(api_key="your_api_key")
scraper.set_fields(["title", "price", "description"])

urls = [
    "https://site1.com/products",
    "https://site2.com/items", 
    "https://site3.com/listings"
]

# Scrape all URLs and save as JSON (default)
results = scraper.scrape_multiple_urls(urls, save_to_files=True)

# Scrape all URLs and save as CSV for analysis
results = scraper.scrape_multiple_urls(urls, save_to_files=True, format='csv')

for result in results:
    if result.get('error'):
        print(f"Failed {result['url']}: {result['error']}")
    else:
        print(f"Success {result['url']}: {result['metadata']['items_extracted']} items")

Custom Configuration

scraper = UniversalScraper(
    api_key="your_api_key",
    temp_dir="custom_temp",      # Custom temporary directory
    output_dir="custom_output",  # Custom output directory  
    log_level=logging.DEBUG,     # Enable debug logging
    model_name="gpt-4"           # Custom model (OpenAI, Gemini, Claude, etc.)
)

# Configure for e-commerce scraping
scraper.set_fields([
    "product_name",
    "product_price", 
    "product_rating",
    "product_reviews_count",
    "product_availability",
    "product_description"
])

# Check and change model dynamically
print(f"Current model: {scraper.get_model_name()}")
scraper.set_model_name("gpt-4")  # Switch to OpenAI
print(f"Switched to: {scraper.get_model_name()}")

# Or switch to Claude
scraper.set_model_name("claude-3-sonnet-20240229")
print(f"Switched to: {scraper.get_model_name()}")

result = scraper.scrape_url("https://ecommerce-site.com", save_to_file=True)

API Reference