Skip to content

neur0map/docrawl

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

26 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

docrawl

@@@@@@@ @@@@@@ @@@@@@@ @@@@@@@ @@@@@@ @@@ @@@ @@@ @@@
@@@@@@@@ @@@@@@@@ @@@@@@@@ @@@@@@@@ @@@@@@@@ @@@ @@@ @@@ @@@
@@! @@@ @@! @@@ !@@ @@! @@@ @@! @@@ @@! @@! @@! @@!
!@! @!@ !@! @!@ !@! !@! @!@ !@! @!@ !@! !@! !@! !@!
@!@ !@! @!@ !@! !@! @!@!!@! @!@!@!@! @!! !!@ @!@ @!!
!@! !!! !@! !!! !!! !!@!@! !!!@!!!! !@! !!! !@! !!!
!!: !!! !!: !!! :!! !!: :!! !!: !!! !!: !!: !!: !!:
:!: !:! :!: !:! :!: :!: !:! :!: !:! :!: :!: :!: :!:
:::: :: ::::: :: ::: ::: :: ::: :: ::: :::: :: ::: :: ::::
:: : : : : : :: :: : : : : : : : :: : : : : :: : :

A documentation-focused web crawler that converts sites to clean Markdown while preserving structure and staying polite.

Demo VideoCrates.ioGitHub

Crates.io Documentation License: MIT

Installation

# Install from crates.io
cargo install docrawl

# Or build from source
git clone https://github.com/neur0map/docrawl
cd docrawl
cargo build --release

Quick Start

docrawl "https://docs.rust-lang.org"          # crawl with default depth
docrawl "https://docs.python.org" --all       # full site crawl
docrawl "https://react.dev" --depth 2         # shallow crawl
docrawl "https://nextjs.org/docs" --fast      # quick scan without assets
docrawl "https://example.com/docs" --silence  # suppress progress/status output
docrawl --update                              # update to latest version

Key Features

  • Documentation-optimized extraction - Built-in selectors for Docusaurus, MkDocs, Sphinx, Next.js docs
  • Clean Markdown output - Preserves code blocks, tables, and formatting with YAML frontmatter metadata
  • Path-mirroring structure - Maintains original URL hierarchy as folders with index.md files
  • Polite crawling - Respects robots.txt, rate limits, and sitemap hints
  • Security-first - Sanitizes content, detects prompt injections, quarantines suspicious pages
  • Self-updating - Built-in update mechanism via docrawl --update

Why docrawl?

Unlike general-purpose crawlers, docrawl is purpose-built for documentation:

Tool Purpose Output Documentation Support
wget/curl File downloading Raw HTML No extraction
httrack Website mirroring Full HTML site No Markdown conversion
scrapy Web scraping framework Custom formats Requires coding
docrawl Documentation crawler Clean Markdown Auto-detects docs frameworks

docrawl combines crawling, extraction, and conversion in a single tool optimized for technical documentation.

Library Usage

Add to your Cargo.toml:

[dependencies]
docrawl = "0.1"
tokio = { version = "1", features = ["full"] }

Minimal example:

#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error>> {
    let cfg = docrawl::CrawlConfig {
        base_url: url::Url::parse("https://example.com/docs")?,
        output_dir: std::path::PathBuf::from("./out"),
        max_depth: Some(3),
        ..Default::default()
    };
    let stats = docrawl::crawl(cfg).await?;
    println!("Crawled {} pages", stats.pages);
    Ok(())
}

CLI Options

Option Description Default
--depth <n> Maximum crawl depth 10
--all Crawl entire site -
--output <dir> Output directory Current dir
--rate <n> Requests per second 2
--concurrency <n> Parallel workers 8
--selector <css> Custom content selector Auto-detect
--fast Quick mode (no assets, rate=16) -
--resume Continue previous crawl -
--silence Suppress built-in progress/status output -
--update Update to latest version from crates.io -

Configuration

Optional docrawl.config.json:

{
  "selectors": [".content", "article"],
  "exclude_patterns": ["\\.pdf$", "/api/"],
  "max_pages": 1000,
  "host_only": true
}

Output Structure

output/
└── example.com/
    ├── index.md
    ├── guide/
    │   └── index.md
    ├── assets/
    │   └── images/
    └── manifest.json

Each Markdown file includes frontmatter:

---
title: Page Title
source_url: https://example.com/page
fetched_at: 2025-01-18T12:00:00Z
---

Performance

docrawl is optimized for speed and efficiency:

  • Fast HTML to Markdown conversion using fast_html2md
  • Concurrent processing with configurable worker pools
  • Intelligent rate limiting to respect server resources
  • Persistent caching to avoid duplicate work
  • Memory-efficient streaming for large sites

Security

docrawl includes built-in security features:

  • Content sanitization removes potentially harmful HTML
  • Prompt injection detection identifies and quarantines suspicious content
  • URL validation prevents malicious redirects
  • File system sandboxing restricts output to specified directories
  • Rate limiting prevents overwhelming target servers

Contributing

Contributions are welcome! Please feel free to submit a Pull Request. For major changes, please open an issue first to discuss what you would like to change.

License

MIT

About

Docs‑focused crawler that converts documentation sites to clean Markdown.

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Languages