π OctoBotCrawler - A Python alternative to the GPT Crawler by Builder.io! ππΈοΈ Built with FastAPI, Redis, and RQ workers, it's designed for scalable web crawling with concurrent page processing! πβ¨
- Unique Job Identifiers: Utilizes UUIDs to ensure each crawl job is uniquely identifiable, preventing output file overwrites.
- Filtered Data Retrieval: Provides an endpoint to retrieve a subset of crawled data based on a list of specific URLs.
- Concurrent Crawling: Supports concurrent crawling of multiple pages for faster data collection.
- JSON Output: Outputs the crawled data in JSON format.
- Headless Browser Automation: Uses Playwright for efficient and reliable headless browser operations.
Once the crawl generates a file called output.json
in the app/outputs/
directory, you can upload it to OpenAI to create your custom GPT or custom assistant.
Use this option for UI access to your generated knowledge that you can easily share with others.
- Go to ChatGPT.
- Click your name in the bottom left corner.
- Choose "My GPTs" in the menu.
- Choose "Create a GPT".
- Choose "Configure".
- Under "Knowledge", choose "Upload a file" and upload the
output.json
file generated from the crawl.
Gif credit: Builder.io
Use this option for API access to your generated knowledge, which you can integrate into your product.
- Go to OpenAI Assistants.
- Click "+ Create".
- Choose "Upload" and upload the
output.json
file generated from the crawl.
This will allow you to create an assistant using the knowledge you generated during the crawl!
.
βββ Dockerfile # Docker configuration for FastAPI and worker
βββ docker-compose.yml # Docker Compose configuration for all services
βββ app/
β βββ main.py # FastAPI application entry point
β βββ worker.py # RQ worker for processing crawl jobs
β βββ gptcrawlercore.py # Core logic for crawling web pages using Playwright
β βββ outputs/ # Directory for storing crawl output JSON files
β βββ requirements.txt # Python dependencies
βββ README.md # Project documentation
git clone https://github.com/badboysm890/octobotcrawler.git
cd octobotcrawler
Use Docker Compose to build and run the FastAPI, Redis, and worker services.
docker-compose up --build
This command will:
- Build the FastAPI app container.
- Build the RQ worker container.
- Start the Redis container.
FastAPI will be running on http://localhost:8000
.
To start a new crawl job, make a POST request to the /crawl
endpoint with the URL you want to crawl and the number of pages (max_pages
) you want to scrape.
curl -X POST "http://localhost:8000/crawl?url=https://example.com&max_pages=10"
{
"job_id": "3fa85f64-5717-4562-b3fc-2c963f66afa6",
"status": "queued"
}
Once the job is complete, you can retrieve the results using the job_id
returned in the /crawl
request.
To get the status of a crawl job, use the /status/{job_id}
endpoint.
curl "http://localhost:8000/status/3fa85f64-5717-4562-b3fc-2c963f66afa6"
{
"job_id": "3fa85f64-5717-4562-b3fc-2c963f66afa6",
"status": "in_progress",
"links_found": 50,
"pages_crawled": 10,
"current_url": "https://example.com/page10",
"crawled_urls": [
"https://example.com",
"https://example.com/about",
"... more URLs ..."
],
"crawling_urls": [
"https://example.com/contact"
],
"start_time": 1700000000,
"end_time": null,
"errors": [
{
"url": "https://example.com/broken-link",
"message": "Timeout while navigating to https://example.com/broken-link (Attempt 3/3)",
"timestamp": 1700000100
}
]
}
Before downloading the entire JSON output, you can request a subset of the crawled data based on a list of specific URLs using the /filtered-output/{job_id}
endpoint.
curl -X POST "http://localhost:8000/filtered-output/3fa85f64-5717-4562-b3fc-2c963f66afa6" \
-H "Content-Type: application/json" \
-d '{"urls": ["https://example.com/page1", "https://example.com/page2"]}'
{
"job_id": "3fa85f64-5717-4562-b3fc-2c963f66afa6",
"filtered_data": [
{
"title": "Page 1 Title",
"url": "https://example.com/page1",
"text": "Content of page 1..."
},
{
"title": "Page 2 Title",
"url": "https://example.com/page2",
"text": "Content of page 2..."
}
],
"missing_urls": [
"https://example.com/page3"
]
}
Field Descriptions:
filtered_data
: Contains data for URLs that were found and crawled.missing_urls
: Lists URLs that were requested but not found in the crawl results.
You can download the entire crawl results using the /get-output/{job_id}
endpoint.
curl -O "http://localhost:8000/get-output/3fa85f64-5717-4562-b3fc-2c963f66afa6"
If the job is complete, you will receive the 3fa85f64-5717-4562-b3fc-2c963f66afa6.json
file with the crawled data.
If the job hasn't finished or if there's an issue retrieving the output, you'll receive an appropriate error message, such as:
{
"detail": "Output file not found"
}
- Port:
8000
- Purpose: Serves as the API interface for the crawler system. It accepts crawl requests, returns job status, and provides access to the crawl results.
- Port:
6379
- Purpose: Acts as the message broker and queue system for the worker processes. Jobs are enqueued in Redis and processed by RQ workers.
- Purpose: Processes the crawl jobs. It pulls jobs from the Redis queue and executes them using the core crawling logic in
gptcrawlercore.py
.
- main.py: The FastAPI application that handles crawl requests and retrieves results.
- worker.py: The RQ worker responsible for executing crawl jobs asynchronously.
- gptcrawlercore.py: Contains the core logic for web crawling using Playwright.
- outputs/: Directory where the crawled data (output files) is stored as JSON.
- requirements.txt: Lists the Python dependencies required for the project.
- Dockerfile: Docker configuration for containerizing the FastAPI app and worker.
- docker-compose.yml: Docker Compose configuration for orchestrating the FastAPI app, Redis, and worker services.
You can modify the number of pages to be crawled by adjusting the max_pages
parameter in the request. The concurrency of the crawler can be modified by editing the GPTCrawlerCore
class in gptcrawlercore.py
.
- Redis Connection Error: If you see
redis.exceptions.ConnectionError
, ensure that your FastAPI app is connecting to Redis using theredis
service name in Docker and notlocalhost
. - Permission Issues: If you encounter permission issues when writing files to the
outputs
directory, ensure the directory has appropriate write permissions (e.g.,chmod -R 777 app/outputs
). - Job ID Not Found: Ensure that the
job_id
provided in the request matches an existing job. Use the/status/{job_id}
endpoint to verify job existence.
This project is licensed under the MIT License. See the LICENSE
file for details.