Skip to content

Conversation

@hayatosc
Copy link

@hayatosc hayatosc commented Jul 8, 2025

close #19

Overview

This PR will add a sitemap.xml support for site crawling.

Some sites (like https://nextjs.org) using sitemap.xml so you can now use this by cli options.

# With sitemap auto-detection (using /sitemap.xml, /sitemap_index.xml, or find from /robot.txt)
$ sitemcp https://nextjs.org/docs --sitemap

# With custom sitemap URL
$ sitemcp https://nextjs.org/docs --sitemap-url /sitemap.xml

Feature

New Options

  • --sitemap: boolean Use sitemap.xml to discover URLs (auto-detect)
  • --sitemap-url: string Custom sitemap URL path (if this options is used, it will use sitemap regardless of that --sitemap is enabled)
  • --timeout: number Timeout in seconds for site fetching (default: 60)

Background Fetching using Worker Threads

After testing the core logic, I realized this MCP Server make an error in Claude Code because site fetching takes over 30s and timed out (for example, https://nextjs.org has over 300 pages). To avoid this, I implemented background processing using Worker Threads.

If you do not like it, please comment this and I will roll back soon.

Copy link

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Summary of Changes

Hello @hayatosc, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly enhances the sitemcp tool by adding comprehensive support for discovering URLs via sitemaps and implementing background site fetching using Node.js Worker Threads. These changes aim to improve the tool's efficiency and robustness when dealing with large websites, preventing timeouts and providing a more responsive user experience by performing intensive operations asynchronously.

Highlights

  • Sitemap Support: Introduced new CLI options (--sitemap and --sitemap-url) to enable URL discovery via sitemap.xml files. This includes auto-detection from common paths (/sitemap.xml, /sitemap_index.xml) and references found in robots.txt, as well as support for custom sitemap URLs and sitemap indexes.
  • Background Site Fetching with Worker Threads: The core site content fetching process has been moved to a dedicated Node.js Worker Thread. This prevents the main application thread from blocking, especially for large websites, and addresses potential timeouts, improving overall responsiveness.
  • Configurable Fetching Timeout: A new --timeout CLI option (defaulting to 60 seconds) allows users to specify a maximum duration for the entire site fetching operation. The fetching logic now includes checks to terminate early if this global timeout is reached, even during individual page fetches or content processing.
  • Improved Server Responsiveness and Status: The MCP server's get_index and get_page methods now provide real-time status updates ('fetching' or 'ready') and the total number of pages discovered. If a requested page is not found and fetching is not active, the server can proactively trigger a background fetch, enhancing the user experience for dynamic content discovery.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in issue comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist is currently in preview and may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments to provide feedback.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

@recurseml
Copy link

recurseml bot commented Jul 8, 2025

😱 Found 1 issue. Time to roll up your sleeves! 😱

⚠️ Only 5 files were analyzed due to processing limits.

Need help? Join our Discord for support!
https://discord.gg/qEjHQk64Z9

Copy link

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces sitemap support and parallel fetching using worker threads to improve site discovery and performance. The changes include new CLI options, modifications to the fetching logic, and the addition of a worker thread for background processing. The code includes timeout mechanisms and caching to handle large sites and improve efficiency.

src/server.ts Outdated
Comment on lines 75 to 77
const isDev = import.meta.dirname.includes("src");
const workerFile = isDev ? "worker.ts" : "worker.mjs";
const workerPath = path.join(import.meta.dirname, workerFile);

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The isDev variable relies on import.meta.dirname including "src". This might not be reliable in all environments (e.g., if the compiled output structure is different). A more robust approach would be to use a dedicated environment variable or configuration setting to determine the environment.

@pkg-pr-new
Copy link

pkg-pr-new bot commented Jul 14, 2025

Open in StackBlitz

npm i https://pkg.pr.new/ryoppippi/sitemcp@20

commit: 6064f57

@ryoppippi
Copy link
Owner

Thank you for your contribution! Is this pr ready for review? @hayatosc

@ryoppippi
Copy link
Owner

ryoppippi commented Jul 14, 2025

@hayatosc I start look at it, but if you implement communication between worker and main thread, I recommend to use https://github.com/antfu/birpc

…ignal handling

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <[email protected]>
@hayatosc
Copy link
Author

@ryoppippi I think it's ready

@ryoppippi
Copy link
Owner

Thank you!!!
I think the impllementation is good, however i just wondering why we need to use workers. it increase the complexity

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

feat: Add sitemap.xml support for efficient site discovery

2 participants