- How Universal Scraper Works
- Live Working Example
- How It Works
- Smart HTML Cleaner
- Installation (Recommended)
- Installation
- Quick Start
- Export Formats
- CLI Usage
- MCP Server Usage
- Cache Management
- Advanced Usage
- API Reference
- Output Format
- Common Field Examples
- Multi-Provider AI Support
- Troubleshooting
- Roadmap
- Contributors
- Contributing
- License
- Changelog
A Python module for AI-powered web scraping with customizable field extraction using multiple AI providers (Gemini, OpenAI, Anthropic, and more via LiteLLM).
- Traditionally, Developers have to write the Web Scraper manually using technologies such as
requests/cloudscraper/selenium(inPython) orAxios/Cheerio/Puppeteer/selenium(inJS) - We need to write BeautifulSoup4 selectors using xpath/class/id etc, by analysing the HTML
- Even slight change in HTML structure breaks the scrapers, this fragility means scrapers require constant, time-consuming maintenance and frequent rewrites to remain functional
- Writing end to end web scrapers from fetching HTML to Parsing it and then exporting that data in JSON or CSV is time consuming
- How about a module, which can write BeautifulSoup4 code on the fly by Analysing 98%+ less sized HTML structure, then use that extraction code for subsequent pages who have same HTML structure
- A module which only regenerates the Beautifulsoup4 code, only if the HTML structure is changed
- A module which can do couple of hours of web scraping task in just 5 seconds and still
costing less than 0.7 cents(~$0.00786) on LLM API Calls (for generating Extraction code only)
graph TB
A[🌐 Input URL] --> B[📥 HTML Fetcher]
B --> B1[CloudScraper Anti-Bot Protection]
B1 --> C[🧹 Smart HTML Cleaner]
C --> C1[Remove Scripts & Styles]
C1 --> C2[Remove Ads & Analytics]
C2 --> C2a[Remove Inline SVG Images]
C2a --> C2aa[Replace URLs with Placeholders]
C2aa --> C2b[Remove Non-Essential HTML Attributes]
C2b --> C3[Remove Navigation Elements]
C3 --> C4[Detect Repeating Structures]
C4 --> C5[Keep 2 Samples, Remove Others]
C5 --> C6[Remove Empty Divs]
C6 --> D[📊 98% Size Reduction]
D --> D1[🔗 Generate Structural Hash]
D1 --> E{🔍 Check Code Cache}
E -->|Cache Hit & Hash Match| F[♻️ Use Cached Code]
E -->|Cache Miss or Hash Changed| E1[🗑️ Discard Old Cache]
E1 --> G[🤖 AI Code Generation]
G --> G1[🧠 Choose AI Provider]
G1 --> G2[Gemini 2.5-Flash Default]
G1 --> G3[OpenAI GPT-4/GPT-4o]
G1 --> G4[Claude 3 Opus/Sonnet/Haiku]
G1 --> G5[100+ Other Models via LiteLLM]
G2 --> H[📝 Generate BeautifulSoup Code]
G3 --> H
G4 --> H
G5 --> H
H --> I[💾 Cache Generated Code + Hash]
F --> J[⚡ Execute Code on Original HTML]
I --> J
J --> K[📋 Extract Structured Data]
K --> L{📁 Output Format}
L -->|JSON| M[💾 Save as JSON]
L -->|CSV| N[📊 Save as CSV]
M --> O[✅ Complete with Metadata]
N --> O
style A fill:#e1f5fe
style D fill:#4caf50,color:#fff
style D1 fill:#ff5722,color:#fff
style E fill:#ff9800,color:#fff
style E1 fill:#f44336,color:#fff
style F fill:#4caf50,color:#fff
style G1 fill:#9c27b0,color:#fff
style O fill:#2196f3,color:#fff
Key Performance Benefits:
- 98% HTML Size Reduction → Massive token savings
- Smart Caching → 90%+ API cost reduction on repeat scraping
- Multi-Provider Support → Choose the best AI for your use case, 100+ LLMs supported
- Dual HTML Processing → Clean HTML and reduces HTML size upto 98.3%+ for AI analysis, original HTML for complete data extraction
- Generates BeautifulSoup4 code on the fly → Generates structural hash of HTML page, so that it reuse extraction code on repeat scraping
Token Count Comparison (Claude Sonnet 4):
- 2,619 tokens: ~$0.00786 (0.8 cents)
- 150,742 tokens: ~$0.45 (45 cents)
- Token ratio: 150,742 ÷ 2,619 = 57.5x more tokens
- Saving: The larger request costs 57.5x more than the smaller one
Here's a real working example showing Universal Scraper in action with Gemini 2.5 Pro:
>>> from universal_scraper import UniversalScraper
>>> scraper = UniversalScraper(api_key="AIzxxxxxxxxxxxxxxxxxxxxx", model_name="gemini-2.5-pro")
2025-09-17 01:22:30 - code_cache - INFO - CodeCache initialized with database: temp/extraction_cache.db
2025-09-17 01:22:30 - data_extractor - INFO - Code caching enabled
2025-09-17 01:22:30 - data_extractor - INFO - Using Google Gemini API with model: gemini-2.5-pro
2025-09-17 01:22:30 - data_extractor - INFO - Initialized DataExtractor with model: gemini-2.5-pro
>>> # Set fields for e-commerce laptop scraping
>>> scraper.set_fields(["product_name", "product_price", "product_rating", "product_description", "availability"])
2025-09-17 01:22:31 - universal_scraper - INFO - Extraction fields updated: ['product_name', 'product_price', 'product_rating', 'product_description', 'availability']
>>> result = scraper.scrape_url("https://webscraper.io/test-sites/e-commerce/allinone/computers/laptops", save_to_file=True, format='csv')
2025-09-17 01:22:33 - universal_scraper.scraper - INFO - Starting scraping for: https://webscraper.io/test-sites/e-commerce/allinone/computers/laptops
2025-09-17 01:22:33 - universal_scraper.core.html_fetcher - INFO - Starting to fetch HTML for: https://webscraper.io/test-sites/e-commerce/allinone/computers/laptops
2025-09-17 01:22:33 - universal_scraper.core.html_fetcher - INFO - Fetching https://webscraper.io/test-sites/e-commerce/allinone/computers/laptops with cloudscraper...
2025-09-17 01:22:33 - universal_scraper.core.html_fetcher - INFO - Successfully fetched content with cloudscraper. Length: 163496
2025-09-17 01:22:33 - universal_scraper.core.html_fetcher - INFO - Successfully fetched HTML with cloudscraper
2025-09-17 01:22:33 - universal_scraper.core.cleaning.base_cleaner - INFO - Starting HTML cleaning process...
2025-09-17 01:22:33 - universal_scraper.core.cleaning.base_cleaner - INFO - Removed noise. Length: 142614
2025-09-17 01:22:33 - universal_scraper.core.cleaning.base_cleaner - INFO - Removed SVG/images. Length: 142614
2025-09-17 01:22:33 - universal_scraper.core.cleaning.base_cleaner - INFO - Replaced 252 URL sources with placeholders.
2025-09-17 01:22:33 - universal_scraper.core.cleaning.base_cleaner - INFO - Replaced URL sources. Length: 133928
2025-09-17 01:22:33 - universal_scraper.core.cleaning.base_cleaner - INFO - Removed iframes. Length: 133928
2025-09-17 01:22:33 - universal_scraper.core.cleaning.base_cleaner - INFO - Removed headers/footers. Length: 127879
2025-09-17 01:22:33 - universal_scraper.core.cleaning.base_cleaner - INFO - Focused on main content. Length: 127642
2025-09-17 01:22:33 - universal_scraper.core.cleaning.base_cleaner - INFO - Limited select options. Length: 127642
2025-09-17 01:22:33 - universal_scraper.core.cleaning.base_cleaner - INFO - Removed 3 empty div elements in 1 iterations
2025-09-17 01:22:33 - universal_scraper.core.cleaning.base_cleaner - INFO - Removed empty divs. Length: 127553
2025-09-17 01:22:33 - universal_scraper.core.cleaning.base_cleaner - INFO - Collapsed 117 long text nodes
2025-09-17 01:22:33 - universal_scraper.core.cleaning.base_cleaner - INFO - Collapsed long text nodes. Length: 123068
2025-09-17 01:22:33 - universal_scraper.core.cleaning.base_cleaner - INFO - Removed 0 non-essential attributes (3560 → 3560)
2025-09-17 01:22:33 - universal_scraper.core.cleaning.base_cleaner - INFO - Removed non-essential attributes. Length: 123068
2025-09-17 01:22:33 - universal_scraper.core.cleaning.base_cleaner - INFO - Removed whitespace between tags. Length: 123068 → 118089 (4.0% reduction)
2025-09-17 01:22:33 - universal_scraper.core.cleaning.base_cleaner - INFO - Found 117 similar structures, keeping 2, removing 115
2025-09-17 01:22:33 - universal_scraper.core.cleaning.base_cleaner - INFO - Found 117 similar structures, keeping 2, removing 115
2025-09-17 01:22:33 - universal_scraper.core.cleaning.base_cleaner - INFO - Found 117 similar structures, keeping 2, removing 115
2025-09-17 01:22:33 - universal_scraper.core.cleaning.base_cleaner - INFO - Found 117 similar structures, keeping 2, removing 115
2025-09-17 01:22:33 - universal_scraper.core.cleaning.base_cleaner - INFO - Removed 115 repeating structure elements
2025-09-17 01:22:33 - universal_scraper.core.cleaning.base_cleaner - INFO - Removed repeating structures. Length: 2371
2025-09-17 01:22:33 - universal_scraper.core.cleaning.base_cleaner - INFO - Removed 0 empty div elements in 0 iterations
2025-09-17 01:22:33 - universal_scraper.core.cleaning.base_cleaner - INFO - Removed empty divs (post-compression). Length: 2371
2025-09-17 01:22:33 - universal_scraper.core.cleaning.base_cleaner - INFO - HTML cleaning completed. Original: 150742, Final: 2371
2025-09-17 01:22:33 - universal_scraper.core.cleaning.base_cleaner - INFO - Reduction: 98.4%
2025-09-17 01:22:33 - data_extractor - INFO - Using HTML separation: cleaned for code generation, original for execution
2025-09-17 01:22:33 - code_cache - INFO - Cache MISS for https://webscraper.io/test-sites/e-commerce/allinone/computers/laptops
2025-09-17 01:22:33 - data_extractor - INFO - Generating BeautifulSoup code with gemini-2.5-pro for fields: ['product_name', 'product_price', 'product_rating', 'product_description', 'availability']
2025-09-17 01:22:37 - code_cache - INFO - Code cached for https://webscraper.io/test-sites/e-commerce/allinone/computers/laptops (hash: bd0ed6e62683fcfb...)
2025-09-17 01:22:37 - data_extractor - INFO - Successfully generated BeautifulSoup code
2025-09-17 01:22:37 - data_extractor - INFO - Executing generated extraction code...
2025-09-17 01:22:37 - data_extractor - INFO - Successfully extracted data with 117 items
2025-09-17 01:22:37 - universal_scraper - INFO - Successfully extracted data from https://webscraper.io/test-sites/e-commerce/allinone/computers/laptops
>>>
# ✨ Results: 117 laptop products extracted from 163KB HTML in ~5 seconds!
# 🎯 98.4% HTML size reduction (163KB → 2.3KB for AI processing to generate BeautifulSoup4 code)
# 💾 Data automatically saved as CSV with product_name, product_price, product_rating, etc.What Just Happened:
- Fields Configured for e-commerce: product_name, product_price, product_rating, etc.
- HTML Fetched with anti-bot protection (163KB)
- Smart Cleaning reduced size by 98.4% (163KB → 2.3KB)
- AI Generated custom extraction code using GPT-4o for specified fields
- Code Cached for future use (90% cost savings on re-runs)
- 117 Laptop Products Extracted from original HTML with complete data
- Saved as CSV ready for analysis with all specified product fields
- HTML Fetching: Uses cloudscraper or selenium to fetch HTML content, handling anti-bot measures
- Smart HTML Cleaning: Removes 98%+ of noise (scripts, ads, navigation, repeated structures, empty divs) while preserving data structure
- Structure-Based Caching: Creates structural hash and checks cache for existing extraction code
- AI Code Generation: Uses your chosen AI provider (Gemini, OpenAI, Claude, etc.) to generate custom BeautifulSoup code on cleaned HTML (only when not cached)
- Code Execution: Runs the cached/generated code on original HTML to extract ALL data items
- Export Output data as Json/CSV: Returns complete, consistent, structured data with metadata and performance stats
- Scripts & Styles: JavaScript, CSS, and style blocks
- Ads & Analytics: Advertisement content and tracking scripts
- Navigation: Headers, footers, sidebars, and menu elements
- Metadata: Meta tags, SEO tags, and hidden elements
- Empty Elements: Recursively removes empty div elements that don't contain meaningful content
- Noise: Comments, unnecessary attributes, and whitespace
- Inline SVG Image: Makes the page size bulky
- URL Placeholders: Replaces long URLs (src, href, action) with short placeholders like [IMG_URL], [LINK_URL] to reduce token count
- Non Essential Attributes Remover: It Distinguishes between essential attributes (id, class, href, data-price) and non-essential ones (style, onclick, data-analytics)
- Whitespace & Blank Lines Remover: It Compresses the final HTML, before sending it to LLM for analysis
The cleaner intelligently detects and reduces repeated HTML structures:
- Pattern Detection: Uses structural hashing + similarity algorithms to find repeated elements
- Smart Sampling: Keeps 2 samples from groups of 3+ similar structures (e.g., 20 job cards → 2 samples)
- Structure Preservation: Maintains document flow and parent-child relationships
- AI Optimization: Provides enough samples for pattern recognition without overwhelming the AI
The cleaner intelligently removes empty div elements:
- Recursive Processing: Starts from innermost divs and works outward
- Content Detection: Preserves divs with text, images, inputs, or interactive elements
- Structure Preservation: Maintains parent-child relationships and avoids breaking important structural elements
- Smart Analysis: Removes placeholder/skeleton divs while keeping functional containers
Example: Removes empty animation placeholders like <div class="animate-pulse"></div> while preserving divs containing actual content.
pip install universal-scraper
brew install pipx
sudo pipx install "universal-scraper[mcp]" --global
-
Clone the repository:
git clone <repository-url> cd Universal_Scrapper
-
Install dependencies:
pip install -r requirements.txt
Or install manually:
pip install google-generativeai beautifulsoup4 requests selenium lxml fake-useragent
-
Install the module:
pip install -e .
Option A: Use Gemini (Default - Recommended) Get a Gemini API key from Google AI Studio:
export GEMINI_API_KEY="your_gemini_api_key_here"Option B: Use OpenAI
export OPENAI_API_KEY="your_openai_api_key_here"Option C: Use Anthropic Claude
export ANTHROPIC_API_KEY="your_anthropic_api_key_here"Option D: Pass API key directly
# For any provider - just pass the API key directly
scraper = UniversalScraper(api_key="your_api_key")from universal_scraper import UniversalScraper
# Option 1: Auto-detect provider (uses Gemini by default)
scraper = UniversalScraper(api_key="your_gemini_api_key")
# Option 2: Specify Gemini model explicitly
scraper = UniversalScraper(api_key="your_gemini_api_key", model_name="gemini-2.5-flash")
# Option 3: Use OpenAI
scraper = UniversalScraper(api_key="your_openai_api_key", model_name="gpt-4")
# Option 4: Use Anthropic Claude
scraper = UniversalScraper(api_key="your_anthropic_api_key", model_name="claude-3-sonnet-20240229")
# Option 5: Use any other provider supported by LiteLLM
scraper = UniversalScraper(api_key="your_api_key", model_name="llama-2-70b-chat")
# Set the fields you want to extract
scraper.set_fields([
"company_name",
"job_title",
"apply_link",
"salary_range",
"location"
])
# Check current model
print(f"Using model: {scraper.get_model_name()}")
# Scrape a URL (default JSON format)
result = scraper.scrape_url("https://example.com/jobs", save_to_file=True)
print(f"Extracted {result['metadata']['items_extracted']} items")
print(f"Data saved to: {result.get('saved_to')}")
# Scrape and save as CSV
result = scraper.scrape_url("https://example.com/jobs", save_to_file=True, format='csv')
print(f"CSV data saved to: {result.get('saved_to')}")For quick one-off scraping:
from universal_scraper import scrape
# Quick scraping with default JSON format
data = scrape(
url="https://example.com/jobs",
api_key="your_gemini_api_key",
fields=["company_name", "job_title", "apply_link"]
)
# Quick scraping with CSV format
data = scrape(
url="https://example.com/jobs",
api_key="your_gemini_api_key",
fields=["company_name", "job_title", "apply_link"],
format="csv"
)
# Quick scraping with OpenAI
data = scrape(
url="https://example.com/jobs",
api_key="your_openai_api_key",
fields=["company_name", "job_title", "apply_link"],
model_name="gpt-4"
)
# Quick scraping with Anthropic Claude
data = scrape(
url="https://example.com/jobs",
api_key="your_anthropic_api_key",
fields=["company_name", "job_title", "apply_link"],
model_name="claude-3-haiku-20240307"
)
print(data['data']) # The extracted dataUniversal Scraper supports multiple output formats to suit your data processing needs:
# JSON is the default format
result = scraper.scrape_url("https://example.com/jobs", save_to_file=True)
# or explicitly specify
result = scraper.scrape_url("https://example.com/jobs", save_to_file=True, format='json')JSON Output Structure:
{
"url": "https://example.com",
"timestamp": "2025-01-01T12:00:00",
"fields": ["company_name", "job_title", "apply_link"],
"data": [
{
"company_name": "Example Corp",
"job_title": "Software Engineer",
"apply_link": "https://example.com/apply/123"
}
],
"metadata": {
"raw_html_length": 50000,
"cleaned_html_length": 15000,
"items_extracted": 1
}
}# Export as CSV for spreadsheet analysis
result = scraper.scrape_url("https://example.com/jobs", save_to_file=True, format='csv')CSV Output:
- Clean tabular format with headers
- All fields as columns, missing values filled with empty strings
- Perfect for Excel, Google Sheets, or pandas processing
- Automatically handles varying field structures across items
urls = ["https://site1.com", "https://site2.com", "https://site3.com"]
# Save all as JSON (default)
results = scraper.scrape_multiple_urls(urls, save_to_files=True)
# Save all as CSV
results = scraper.scrape_multiple_urls(urls, save_to_files=True, format='csv')# Gemini (default) - auto-detects from environment
universal-scraper https://example.com/jobs --output jobs.json
# OpenAI GPT models
universal-scraper https://example.com/products --api-key YOUR_OPENAI_KEY --model gpt-4 --format csv
# Anthropic Claude models
universal-scraper https://example.com/data --api-key YOUR_ANTHROPIC_KEY --model claude-3-haiku-20240307
# Custom fields extraction
universal-scraper https://example.com/listings --fields product_name product_price product_rating
# Batch processing multiple URLs
universal-scraper --urls urls.txt --output-dir results --format csv --model gpt-4o-mini
# Verbose logging with any provider
universal-scraper https://example.com --api-key YOUR_KEY --model gpt-4 --verbose🔧 Advanced CLI Options:
# Set custom extraction fields
universal-scraper URL --fields title price description availability
# Use environment variables (auto-detected)
export OPENAI_API_KEY="your_key"
universal-scraper URL --model gpt-4
# Multiple output formats
universal-scraper URL --format json # Default
universal-scraper URL --format csv # Spreadsheet-ready
# Batch processing
echo -e "https://site1.com\nhttps://site2.com" > urls.txt
universal-scraper --urls urls.txt --output-dir batch_results🔗 Provider Support: All 100+ models supported by LiteLLM work in CLI! See LiteLLM Providers for complete list.
Development Usage (from cloned repo):
python main.py https://example.com/jobs --api-key YOUR_KEY --model gpt-4Universal Scraper works as an MCP (Model Context Protocol) server, allowing AI assistants to scrape websites directly.
- Install with MCP support:
pip install universal-scraper- Set your AI API key:
export GEMINI_API_KEY="your_key" # or OPENAI_API_KEY, ANTHROPIC_API_KEYAdd this to your Claude Code MCP settings:
{
"mcpServers": {
"universal-scraper": {
"command": "universal-scraper-mcp"
}
}
}or Run this command in your terminal
claude mcp add universal-scraper universal-scraper-mcp
Add this to your Cursor MCP configuration:
{
"mcpServers": {
"universal-scraper": {
"command": "universal-scraper-mcp"
}
}
}- scrape_url: Scrape a single URL
- scrape_multiple_urls: Scrape multiple URLs
- configure_scraper: Set API keys and models
- get_scraper_info: Check current settings
- clear_cache: Clear cached data
Once configured, just ask your AI assistant:
"Scrape https://news.ycombinator.com and extract the top story titles and links"
"Scrape this product page and get the price, name, and reviews"
scraper = UniversalScraper(api_key="your_key")
# View cache statistics
stats = scraper.get_cache_stats()
print(f"Cached entries: {stats['total_entries']}")
print(f"Total cache hits: {stats['total_uses']}")
# Clear old entries (30+ days)
removed = scraper.cleanup_old_cache(30)
print(f"Removed {removed} old entries")
# Clear entire cache
scraper.clear_cache()
# Disable/enable caching
scraper.disable_cache() # For testing
scraper.enable_cache() # Re-enablescraper = UniversalScraper(api_key="your_api_key")
scraper.set_fields(["title", "price", "description"])
urls = [
"https://site1.com/products",
"https://site2.com/items",
"https://site3.com/listings"
]
# Scrape all URLs and save as JSON (default)
results = scraper.scrape_multiple_urls(urls, save_to_files=True)
# Scrape all URLs and save as CSV for analysis
results = scraper.scrape_multiple_urls(urls, save_to_files=True, format='csv')
for result in results:
if result.get('error'):
print(f"Failed {result['url']}: {result['error']}")
else:
print(f"Success {result['url']}: {result['metadata']['items_extracted']} items")scraper = UniversalScraper(
api_key="your_api_key",
temp_dir="custom_temp", # Custom temporary directory
output_dir="custom_output", # Custom output directory
log_level=logging.DEBUG, # Enable debug logging
model_name="gpt-4" # Custom model (OpenAI, Gemini, Claude, etc.)
)
# Configure for e-commerce scraping
scraper.set_fields([
"product_name",
"product_price",
"product_rating",
"product_reviews_count",
"product_availability",
"product_description"
])
# Check and change model dynamically
print(f"Current model: {scraper.get_model_name()}")
scraper.set_model_name("gpt-4") # Switch to OpenAI
print(f"Switched to: {scraper.get_model_name()}")
# Or switch to Claude
scraper.set_model_name("claude-3-sonnet-20240229")
print(f"Switched to: {scraper.get_model_name()}")
result = scraper.scrape_url("https://ecommerce-site.com", save_to_file=True)UniversalScraper(api_key=None, temp_dir="temp", output_dir="output", log_level=logging.INFO, model_name=None)api_key: AI provider API key (auto-detects provider, or set specific env vars)temp_dir: Directory for temporary filesoutput_dir: Directory for output fileslog_level: Logging levelmodel_name: AI model name (default: 'gemini-2.5-flash', supports 100+ models via LiteLLM)- See LiteLLM Providers for complete model list and setup
set_fields(fields: List[str]): Set the fields to extractget_fields() -> List[str]: Get current fields configurationget_model_name() -> str: Get current Gemini model nameset_model_name(model_name: str): Change the Gemini modelscrape_url(url: str, save_to_file=False, output_filename=None, format='json') -> Dict: Scrape a single URLscrape_multiple_urls(urls: List[str], save_to_files=True, format='json') -> List[Dict]: Scrape multiple URLs
scrape(url: str, api_key: str, fields: List[str], model_name: Optional[str] = None, format: str = 'json') -> DictQuick scraping function for simple use cases. Auto-detects AI provider from API key pattern.
Note: For model names and provider-specific setup, refer to the LiteLLM Providers Documentation.
The scraped data is returned in a structured format:
{
"url": "https://example.com",
"timestamp": "2025-01-01T12:00:00",
"fields": ["company_name", "job_title", "apply_link"],
"data": [
{
"company_name": "Example Corp",
"job_title": "Software Engineer",
"apply_link": "https://example.com/apply/123"
}
],
"metadata": {
"raw_html_length": 50000,
"cleaned_html_length": 15000,
"items_extracted": 1
}
}scraper.set_fields([
"company_name",
"job_title",
"apply_link",
"salary_range",
"location",
"job_description",
"employment_type",
"experience_level"
])scraper.set_fields([
"product_name",
"product_price",
"product_rating",
"product_reviews_count",
"product_availability",
"product_image_url",
"product_description"
])scraper.set_fields([
"article_title",
"article_content",
"article_author",
"publish_date",
"article_url",
"article_category"
])Universal Scraper now supports multiple AI providers through LiteLLM integration:
- Google Gemini (Default):
gemini-2.5-flash,gemini-1.5-pro, etc. - OpenAI:
gpt-4,gpt-4-turbo,gpt-3.5-turbo, etc. - Anthropic:
claude-3-opus-20240229,claude-3-sonnet-20240229,claude-3-haiku-20240307 - 100+ Other Models: Via LiteLLM including Llama, PaLM, Cohere, and more
For complete model names and provider setup: See LiteLLM Providers Documentation
# Gemini (Default - Free tier available)
scraper = UniversalScraper(api_key="your_gemini_key")
# Auto-detects as gemini-2.5-flash
# OpenAI
scraper = UniversalScraper(api_key="sk-...", model_name="gpt-4")
# Anthropic Claude
scraper = UniversalScraper(api_key="sk-ant-...", model_name="claude-3-haiku-20240307")
# Environment variable approach
# Set GEMINI_API_KEY, OPENAI_API_KEY, or ANTHROPIC_API_KEY
scraper = UniversalScraper() # Auto-detects from env vars
# Any other provider from LiteLLM (see link above for model names)
scraper = UniversalScraper(api_key="your_api_key", model_name="llama-2-70b-chat")Quick Reference for Popular Models:
# Gemini Models
model_name="gemini-2.5-flash" # Fast, efficient
model_name="gemini-1.5-pro" # More capable
# OpenAI Models
model_name="gpt-4" # Most capable
model_name="gpt-4o-mini" # Fast, cost-effective
model_name="gpt-3.5-turbo" # Legacy but reliable
# Anthropic Models
model_name="claude-3-opus-20240229" # Most capable
model_name="claude-3-sonnet-20240229" # Balanced
model_name="claude-3-haiku-20240307" # Fast, efficient
# Other Popular Models (see LiteLLM docs for setup)
model_name="llama-2-70b-chat" # Meta Llama
model_name="command-nightly" # Cohere
model_name="palm-2-chat-bison" # Google PaLM🔗 Complete Model List: Visit LiteLLM Providers Documentation for:
- All available model names
- Provider-specific API key setup
- Environment variable configuration
- Rate limits and pricing information
If you don't specify a model, the scraper automatically selects:
- Gemini: If
GEMINI_API_KEYis set or API key contains "AIza" - OpenAI: If
OPENAI_API_KEYis set or API key starts with "sk-" - Anthropic: If
ANTHROPIC_API_KEYis set or API key starts with "sk-ant-"
- API Key Error: Make sure your API key is valid and set correctly:
- Gemini: Set
GEMINI_API_KEYor pass directly - OpenAI: Set
OPENAI_API_KEYor pass directly - Anthropic: Set
ANTHROPIC_API_KEYor pass directly
- Gemini: Set
- Model Not Found: Ensure you're using the correct model name for your provider
- Empty Results: The AI might need more specific field names or the page might not contain the expected data
- Network Errors: Some sites block scrapers - the tool uses cloudscraper to handle most cases
- Model Name Issues: Check LiteLLM Providers for correct model names and setup instructions
Enable debug logging to see what's happening:
import logging
scraper = UniversalScraper(api_key="your_key", log_level=logging.DEBUG)See ROADMAP.md for planned features and improvements.
- Fork the repository
- Create a feature branch
- Make your changes
- Run
pytestto run testcases - Test PEP Standard:
flake8 universal_scraper/ --count --select=E9,F63,F7,F82 --show-source --statistics
flake8 universal_scraper/ --count --exit-zero --max-complexity=10 --max-line-length=127 --statistics
- Submit a pull request
MIT License - see LICENSE file for details.
See CHANGELOG.md for detailed version history and release notes.
