DOCUMENTATION SCRAPER MCP SERVER

Complete Functionality Preservation & Enhancement

This MCP server preserves ALL functionality from both original modules and adds powerful keyword-based URL filtering for targeted crawling.

✅ PRESERVED FROM "Make URL File" MODULE + ENHANCED

✅ Website crawling starting from single URL
✅ Recursive link discovery with configurable depth limits
✅ Smart duplicate URL detection and normalization
✅ Auto-scrolling pages to load dynamic content
✅ MASSIVELY ENHANCED clicking toggle buttons to expand collapsed content
✅ Language filtering with confidence thresholds
✅ User agent rotation for stealth browsing
✅ Comprehensive logging with colored output
✅ Browser resource management and cleanup
✅ Configurable concurrent page processing
✅ URL filtering by patterns and file extensions
✅ Content extraction using CSS selectors
✅ Text file output with organized directory structure
✅ Error handling with retry mechanisms
✅ Progress tracking and status reporting
🆕 Keyword inclusion filtering for targeted URL discovery

🚀 NEW: ADVANCED TOGGLE CLICKING SYSTEM

✅ Enhanced Selector Diversity: 15+ comprehensive CSS selectors including ARIA, Bootstrap, and framework patterns
✅ Configurable Custom Selectors: Add site-specific toggle selectors via configuration
✅ Smart State Detection: Prevents re-clicking already expanded elements using unique element IDs
✅ Intelligent Expansion Verification: Replaces fixed timeouts with page.waitForFunction for real state change detection
✅ Iterative Nested Expansion: Automatically discovers and expands newly revealed toggle buttons (configurable iterations)
✅ Multi-Method Verification: Checks ARIA attributes, controlled elements, CSS classes, and native HTML states
✅ Comprehensive Error Handling: Graceful fallbacks with detailed logging for each expansion attempt
✅ Performance Optimized: Efficient element tracking and batched processing

🆕 NEW: TEXT-BASED CLICK TARGETING

✅ Intelligent Content-Based Clicking: Target elements based on their text content rather than just CSS selectors
✅ Session-Specific Configuration: Pass text-based click patterns per discovery session via MCP arguments
✅ Global Configuration Support: Configure common text-based patterns in the main configuration
✅ Flexible Text Matching: Support for case-sensitive/insensitive matching with 'any' or 'all' keyword matching
✅ Nested Element Support: Click target can be different from text source element (e.g., click button, check header text)
✅ Comprehensive Logging: Detailed logs show text matches and successful clicks for debugging

🆕 NEW: SELECTOR-BASED EXCLUSION FEATURE

✅ Proactive Content Filtering: Remove unwanted elements from DOM before text extraction
✅ Comprehensive Default Patterns: Pre-configured to exclude headers, footers, navbars, sidebars, cookie banners
✅ ARIA and Framework Support: Intelligent targeting of semantic elements and common frameworks
✅ Configurable Exclusion Lists: Customize excluded selectors per use case or globally
✅ DOM Manipulation Before Extraction: Elements are removed from page DOM before content processing
✅ Comprehensive Error Handling: Graceful fallbacks with detailed logging for debugging
✅ Zero Performance Impact: Efficient DOM manipulation with browser-context execution

Text-Based Click Target Examples:

// Example 1: Click buttons containing "Show More" or "Load More"
{
  "clickTargetSelector": "button",
  "textIncludes": ["Show More", "Load More", "Read More"],
  "caseSensitive": false,
  "matchType": "any"
}

// Example 2: Click sections where header contains "Details"
{
  "clickTargetSelector": ".expandable-section",
  "textMatchSelector": "h3.section-title",
  "textIncludes": ["Details", "Additional Information"],
  "caseSensitive": false
}

// Example 3: Session-specific targeting for comments
// Pass in discover-urls sessionTextBasedClickTargets parameter:
{
  "clickTargetSelector": ".comment-toggle",
  "textIncludes": ["Show Replies", "View Comments"],
  "matchType": "any"
}

✅ PRESERVED FROM "Scrape URL File" MODULE

✅ Batch processing URLs from files or arrays
✅ PDF generation with customizable options
✅ Text content extraction and saving
✅ Failed URL tracking and reporting
✅ Auto-scrolling and toggle button clicking
✅ VPN utilities integration (preserved as optional)
✅ Concurrent processing with p-limit
✅ Random wait times between requests
✅ Comprehensive content selector coverage
✅ Multiple output format support
✅ Organized directory structure (texts/, pdfs/)
✅ Detailed logging for each processed URL
✅ Browser page management and cleanup
✅ Network error handling and retries

🆕 MCP ENHANCEMENTS

Real-time progress reporting to Claude
Parameter validation with Zod schemas
Session-based configuration management
Enhanced error reporting with context
Multiple input methods support
Resource usage monitoring
Keyword-based URL filtering for targeted crawling

🔧 MCP TOOLS

discover-urls: Complete Make URL File functionality + Keyword filtering
scrape-urls: Complete Scrape URL File functionality
get-config: View current configuration
update-config: Modify settings dynamically
validate-config: Check configuration validity
get-status: System health monitoring
list-failed-urls: View and retry failed operations

🎯 NEW: KEYWORD FILTERING FEATURE

What is Keyword Filtering?

The new keyword filtering feature allows you to specify keywords that URLs must contain to be included in the crawling process. This enables highly targeted crawling by focusing only on URLs that match your specific interests.

How it Works:

Inclusion-based filtering: URLs are only kept if they contain at least one specified keyword
Case-insensitive matching: Keywords match regardless of capitalization
Full URL and path matching: Keywords are checked against both the complete URL and the path component
Multiple keyword support: Specify multiple keywords - URLs matching ANY keyword will be included
Comprehensive logging: Detailed logs show which keywords matched and why URLs were included/excluded

Usage Examples:

// Example 1: Find only documentation pages
{
  "startUrl": "https://example.com",
  "keywords": ["docs", "documentation", "guide", "tutorial"]
}

// Example 2: Focus on API-related content
{
  "startUrl": "https://developer.example.com", 
  "keywords": ["api", "endpoint", "reference"]
}

// Example 3: Combine with exclusion patterns
{
  "startUrl": "https://blog.example.com",
  "keywords": ["react", "javascript"],
  "excludePatterns": ["comment", "sidebar"]
}

Benefits:

🎯 Targeted crawling: Only discover URLs relevant to your specific needs
🔇 Reduced noise: Eliminate irrelevant pages from your crawl results
⚡ Improved efficiency: Spend less time processing unrelated content
💰 Better resource usage: Focus computational resources on valuable content
🎪 Enhanced precision: Get exactly the type of content you're looking for

🎯 SELECTOR-BASED EXCLUSION FEATURE

What is Selector-Based Exclusion?

The selector-based exclusion feature proactively removes unwanted elements from the page's DOM before text extraction begins. This eliminates common boilerplate content like headers, footers, navigation bars, sidebars, and cookie consent banners from your extracted content.

How it Works:

DOM Manipulation: Elements are removed directly from the browser's DOM using page.evaluate()
Before Content Extraction: Exclusion happens before any text extraction, ensuring clean content
Configurable Selectors: Use default patterns or customize with your own CSS selectors
Error-Resilient: Individual selector failures don't stop the entire process
Comprehensive Logging: Detailed logs show which elements were removed and why

Default Exclusion Patterns:

// Semantic HTML Elements
"header", "footer", "nav", "aside"

// Common CSS Classes and IDs  
".sidebar", "#sidebar", ".cookie-banner", ".cookie-consent"

// ARIA Roles for Accessibility
"[role='banner']", "[role='contentinfo']", "[role='navigation']"

Configuration Examples:

Custom Exclusions for Documentation Sites:

{
  "contentExtraction": {
    "selectorsToExcludeFromText": [
      ".docs-sidebar", ".breadcrumb-nav", ".version-selector", "#table-of-contents"
    ]
  }
}

Benefits:

✨ Cleaner Content: Extract only meaningful content without navigation clutter
🔍 Better Text Quality: Improved signal-to-noise ratio in extracted text
📊 Enhanced Analysis: More accurate content analysis without boilerplate interference
🛡️ Privacy Focused: Automatically removes cookie banners and consent forms
⚡ Performance Optimized: Removes elements before processing, reducing computational overhead

🚀 USAGE

Install dependencies: npm install
Build server: npm run build
Add to Claude Desktop config
Use tools in Claude conversations

Every feature from both original modules is preserved and enhanced.

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
build		build
src		src
.gitignore		.gitignore
ADVANCED_TOGGLE_CLICKING.md		ADVANCED_TOGGLE_CLICKING.md
ARCHITECTURE.md		ARCHITECTURE.md
CLAUDE.md		CLAUDE.md
ERRORS.md		ERRORS.md
FASTMCP_TYPESCRIPT.md		FASTMCP_TYPESCRIPT.md
FUNCTIONALITY_PRESERVATION.md		FUNCTIONALITY_PRESERVATION.md
PRESERVATION_GUIDE.md		PRESERVATION_GUIDE.md
README.md		README.md
SELECTOR_BASED_EXCLUSION_IMPLEMENTATION.md		SELECTOR_BASED_EXCLUSION_IMPLEMENTATION.md
TODO.md		TODO.md
package-lock.json		package-lock.json
package.json		package.json
tsconfig.json		tsconfig.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

DOCUMENTATION SCRAPER MCP SERVER

Complete Functionality Preservation & Enhancement

✅ PRESERVED FROM "Make URL File" MODULE + ENHANCED

🚀 NEW: ADVANCED TOGGLE CLICKING SYSTEM

🆕 NEW: TEXT-BASED CLICK TARGETING

🆕 NEW: SELECTOR-BASED EXCLUSION FEATURE

Text-Based Click Target Examples:

✅ PRESERVED FROM "Scrape URL File" MODULE

🆕 MCP ENHANCEMENTS

🔧 MCP TOOLS

🎯 NEW: KEYWORD FILTERING FEATURE

What is Keyword Filtering?

How it Works:

Usage Examples:

Benefits:

🎯 SELECTOR-BASED EXCLUSION FEATURE

What is Selector-Based Exclusion?

How it Works:

Default Exclusion Patterns:

Configuration Examples:

Custom Exclusions for Documentation Sites:

Benefits:

🚀 USAGE

About

Uh oh!

Releases

Packages

Uh oh!

Languages

Nexus-Digital-Automations/Documentation_Scraper_MCP

Folders and files

Latest commit

History

Repository files navigation

DOCUMENTATION SCRAPER MCP SERVER

Complete Functionality Preservation & Enhancement

✅ PRESERVED FROM "Make URL File" MODULE + ENHANCED

🚀 NEW: ADVANCED TOGGLE CLICKING SYSTEM

🆕 NEW: TEXT-BASED CLICK TARGETING

🆕 NEW: SELECTOR-BASED EXCLUSION FEATURE

Text-Based Click Target Examples:

✅ PRESERVED FROM "Scrape URL File" MODULE

🆕 MCP ENHANCEMENTS

🔧 MCP TOOLS

🎯 NEW: KEYWORD FILTERING FEATURE

What is Keyword Filtering?

How it Works:

Usage Examples:

Benefits:

🎯 SELECTOR-BASED EXCLUSION FEATURE

What is Selector-Based Exclusion?

How it Works:

Default Exclusion Patterns:

Configuration Examples:

Custom Exclusions for Documentation Sites:

Benefits:

🚀 USAGE

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages