This project allows you to scrape and download API documentation (or other websites) for offline use, converting pages into organized Markdown files, and optionally cleaning those Markdown files with a utility script.
- Offline Documentation Download: Scrape a React/JavaScript-powered documentation site using Playwright.
- Markdown Output: Each visited page is converted to
.md
and all links are rewritten as local, relative Markdown links. - Resume Downloads: If the download process is interrupted, it can be resumed without re-downloading everything.
- Cleanup Utility: Optionally remove any content above the first Markdown heading in your
.md
files.
API_DOWNLOADER/
├── .venv/ # Python virtual environment (created after install)
├── utils/
│ └── clean-before-heading.py # Utility script that removes text above first heading
├── install-linux-deps.sh # Bash script to install dependencies (Linux/macOS)
├── Install-windows-deps.bat # Batch file to install dependencies (Windows)
├── md-scrape.py # Main Python scraper/downloader script
└── requirements.txt # Python dependencies (if you prefer pip install -r)
-
.venv/
A folder automatically created by the install scripts to hold your local Python virtual environment. -
utils/clean-before-heading.py
A utility script that removes all text above the first#
heading in each Markdown file (exceptindex.md
). It can optionally process a single file or an entire directory. -
install-linux-deps.sh
A Bash script for Linux/macOS that:- Creates a
.venv/
directory. - Installs/updates pip.
- Installs Python dependencies (including Playwright).
- Installs Playwright’s browsers (Chromium, Firefox, WebKit).
- Creates a
-
Install-windows-deps.bat
A Windows batch file that:- Creates a
.venv/
directory. - Installs/updates pip.
- Installs Python dependencies (including Playwright).
- Installs Playwright’s browsers (Chromium, Firefox, WebKit).
- Creates a
-
md-scrape.py
The main script that asks for a URL to scrape, prompts for an output directory, and saves BFS state so that you can resume if needed. As it visits each page, it:- Downloads the fully-rendered HTML (via Playwright).
- Converts HTML to Markdown (via
markdownify
). - Rewrites links to reference local
.md
files. - Outputs
.md
files into subdirectories.
-
requirements.txt
A simple text file listing the Python package dependencies. You can install them via:pip install -r requirements.txt
(Not strictly required if you’re using the provided install scripts.)
- Python 3.7+ installed (confirm by running
python --version
orpython3 --version
). - Git (optional, but useful for cloning the repo).
- A network connection for initially scraping the site.
- Supported OS:
- Windows 10/11 or Server
- Linux distributions (Ubuntu, Debian, Fedora, etc.)
- macOS
- Playwright (installed automatically by the scripts) requires additional system dependencies for headless browsers. See Playwright docs if you run into issues.
Choose either Windows or Linux/macOS instructions below. Both scripts do the following:
- Create a local Python virtual environment in
.venv
. - Activate that environment (updating PATH for the current shell).
- Upgrade
pip
and install the required Python packages. - Install the Playwright browsers so they can run headless.
- Double-click
Install-windows-deps.bat
(or run it in CMD/PowerShell). - Wait for the script to complete—this may take a few minutes as it downloads and installs browsers.
- Once done, you should see a
.venv
folder in your project directory.
- Make the script executable:
chmod +x install-linux-deps.sh
- Run it:
./install-linux-deps.sh
- Wait for the script to complete; a
.venv
folder will appear in your project directory.
-
Activate the virtual environment (manually, if you’re in a new shell session):
- Windows:
call .venv\Scripts\activate
- Linux/macOS:
source .venv/bin/activate
- Windows:
-
Run the main scraper:
python md-scrape.py
-
Provide inputs when prompted:
- Base URL to scrape (e.g.
https://base.url.com/doc
). - Output directory (e.g.
my_docs
). - Resume if existing BFS state is found.
- Base URL to scrape (e.g.
-
As it runs, the scraper will:
- Render each page with Playwright.
- Convert the HTML to Markdown.
- Rewrite internal links.
- Save
.md
files to your chosen output directory. - Periodically save BFS state to
visited_urls.txt
,to_visit_urls.txt
, andurl_to_local.json
.
- After the scraper finishes, you can optionally run a utility script:
- clean-before-heading.py (in the
utils
folder)
This script removes all text above the first#
heading in a Markdown file (excludingindex.md
).
- clean-before-heading.py (in the
- Still in the activated environment, run:
python utils/clean-before-heading.py
- Provide either a single
.md
filename or a directory path when prompted.- The script will modify all markdown files in the folder (recursive), or just one file.
-
BFS Scraping:
md-scrape.py
maintains two sets:visited
(URLs already processed) andto_visit
(URLs discovered but not yet processed).- For each URL, Playwright loads the page, waits for the main content to appear, and captures the final rendered HTML.
- Markdown conversion is done via markdownify.
-
Link Rewriting:
- All internal links (anchors) pointing to the same base domain are converted to relative local
.md
links.
- All internal links (anchors) pointing to the same base domain are converted to relative local
-
Resume on Interrupt:
- If you exit or lose connection, the script saves BFS state files to disk. On the next run, you can choose to resume.
-
Cleaning Script:
clean-before-heading.py
scans each file, removing lines above the first#
. This is optional post-processing to tidy up your docs.
Happy scraping and offline documentation building!