OctoBotCrawler

🚀 OctoBotCrawler - A Python alternative to the GPT Crawler by Builder.io! 🔍🕸️ Built with FastAPI, Redis, and RQ workers, it's designed for scalable web crawling with concurrent page processing! 🐍✨

Features

Unique Job Identifiers: Utilizes UUIDs to ensure each crawl job is uniquely identifiable, preventing output file overwrites.
Filtered Data Retrieval: Provides an endpoint to retrieve a subset of crawled data based on a list of specific URLs.
Concurrent Crawling: Supports concurrent crawling of multiple pages for faster data collection.
JSON Output: Outputs the crawled data in JSON format.
Headless Browser Automation: Uses Playwright for efficient and reliable headless browser operations.

Upload Data to OpenAI

Once the crawl generates a file called output.json in the app/outputs/ directory, you can upload it to OpenAI to create your custom GPT or custom assistant.

Create a Custom GPT 🧑‍💻

Use this option for UI access to your generated knowledge that you can easily share with others.

Go to ChatGPT.
Click your name in the bottom left corner.
Choose "My GPTs" in the menu.
Choose "Create a GPT".
Choose "Configure".
Under "Knowledge", choose "Upload a file" and upload the output.json file generated from the crawl.

Gif credit: Builder.io

Create a Custom Assistant 🤖

Use this option for API access to your generated knowledge, which you can integrate into your product.

Go to OpenAI Assistants.
Click "+ Create".
Choose "Upload" and upload the output.json file generated from the crawl.

This will allow you to create an assistant using the knowledge you generated during the crawl!

Project Structure

.
├── Dockerfile                 # Docker configuration for FastAPI and worker
├── docker-compose.yml         # Docker Compose configuration for all services
├── app/
│   ├── main.py                # FastAPI application entry point
│   ├── worker.py              # RQ worker for processing crawl jobs
│   ├── gptcrawlercore.py      # Core logic for crawling web pages using Playwright
│   ├── outputs/               # Directory for storing crawl output JSON files
│   ├── requirements.txt       # Python dependencies
└── README.md                  # Project documentation

Prerequisites

Setting Up the Project

Step 1: Clone the Repository

git clone https://github.com/badboysm890/octobotcrawler.git
cd octobotcrawler

Step 2: Build and Run the Docker Containers

Use Docker Compose to build and run the FastAPI, Redis, and worker services.

docker-compose up --build

This command will:

Build the FastAPI app container.
Build the RQ worker container.
Start the Redis container.

FastAPI will be running on http://localhost:8000.

Interacting with the API

Start a Crawl Job

To start a new crawl job, make a POST request to the /crawl endpoint with the URL you want to crawl and the number of pages (max_pages) you want to scrape.

Example Request:

curl -X POST "http://localhost:8000/crawl?url=https://example.com&max_pages=10"

Example Response:

{
  "job_id": "3fa85f64-5717-4562-b3fc-2c963f66afa6",
  "status": "queued"
}

Check Job Status and Retrieve Results

Once the job is complete, you can retrieve the results using the job_id returned in the /crawl request.

Get Job Status:

To get the status of a crawl job, use the /status/{job_id} endpoint.

Example Request:

curl "http://localhost:8000/status/3fa85f64-5717-4562-b3fc-2c963f66afa6"

Example Response:

{
  "job_id": "3fa85f64-5717-4562-b3fc-2c963f66afa6",
  "status": "in_progress",
  "links_found": 50,
  "pages_crawled": 10,
  "current_url": "https://example.com/page10",
  "crawled_urls": [
    "https://example.com",
    "https://example.com/about",
    "... more URLs ..."
  ],
  "crawling_urls": [
    "https://example.com/contact"
  ],
  "start_time": 1700000000,
  "end_time": null,
  "errors": [
    {
      "url": "https://example.com/broken-link",
      "message": "Timeout while navigating to https://example.com/broken-link (Attempt 3/3)",
      "timestamp": 1700000100
    }
  ]
}

Retrieve Filtered Crawl Data

Before downloading the entire JSON output, you can request a subset of the crawled data based on a list of specific URLs using the /filtered-output/{job_id} endpoint.

Example Request:

curl -X POST "http://localhost:8000/filtered-output/3fa85f64-5717-4562-b3fc-2c963f66afa6" \
-H "Content-Type: application/json" \
-d '{"urls": ["https://example.com/page1", "https://example.com/page2"]}'

Example Response:

{
  "job_id": "3fa85f64-5717-4562-b3fc-2c963f66afa6",
  "filtered_data": [
    {
      "title": "Page 1 Title",
      "url": "https://example.com/page1",
      "text": "Content of page 1..."
    },
    {
      "title": "Page 2 Title",
      "url": "https://example.com/page2",
      "text": "Content of page 2..."
    }
  ],
  "missing_urls": [
    "https://example.com/page3"
  ]
}

Field Descriptions:

filtered_data: Contains data for URLs that were found and crawled.
missing_urls: Lists URLs that were requested but not found in the crawl results.

Download Full Crawl Results

You can download the entire crawl results using the /get-output/{job_id} endpoint.

Example Request:

curl -O "http://localhost:8000/get-output/3fa85f64-5717-4562-b3fc-2c963f66afa6"

If the job is complete, you will receive the 3fa85f64-5717-4562-b3fc-2c963f66afa6.json file with the crawled data.

Error Handling

If the job hasn't finished or if there's an issue retrieving the output, you'll receive an appropriate error message, such as:

{
  "detail": "Output file not found"
}

Services

FastAPI

Port: 8000
Purpose: Serves as the API interface for the crawler system. It accepts crawl requests, returns job status, and provides access to the crawl results.

Redis

Port: 6379
Purpose: Acts as the message broker and queue system for the worker processes. Jobs are enqueued in Redis and processed by RQ workers.

Worker

Purpose: Processes the crawl jobs. It pulls jobs from the Redis queue and executes them using the core crawling logic in gptcrawlercore.py.

File Structure

main.py: The FastAPI application that handles crawl requests and retrieves results.
worker.py: The RQ worker responsible for executing crawl jobs asynchronously.
gptcrawlercore.py: Contains the core logic for web crawling using Playwright.
outputs/: Directory where the crawled data (output files) is stored as JSON.
requirements.txt: Lists the Python dependencies required for the project.
Dockerfile: Docker configuration for containerizing the FastAPI app and worker.
docker-compose.yml: Docker Compose configuration for orchestrating the FastAPI app, Redis, and worker services.

Customizing the Crawler

You can modify the number of pages to be crawled by adjusting the max_pages parameter in the request. The concurrency of the crawler can be modified by editing the GPTCrawlerCore class in gptcrawlercore.py.

Troubleshooting

Common Issues

Redis Connection Error: If you see redis.exceptions.ConnectionError, ensure that your FastAPI app is connecting to Redis using the redis service name in Docker and not localhost.
Permission Issues: If you encounter permission issues when writing files to the outputs directory, ensure the directory has appropriate write permissions (e.g., chmod -R 777 app/outputs).
Job ID Not Found: Ensure that the job_id provided in the request matches an existing job. Use the /status/{job_id} endpoint to verify job existence.

License

This project is licensed under the MIT License. See the LICENSE file for details.

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
.vscode		.vscode
app		app
.gitignore		.gitignore
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
docker-compose.yml		docker-compose.yml
index.html		index.html

License

badboysm890/gpt-crawler-python

Folders and files

Latest commit

History

Repository files navigation

OctoBotCrawler

Features

Upload Data to OpenAI

Create a Custom GPT 🧑‍💻

Create a Custom Assistant 🤖

Project Structure

Prerequisites

Setting Up the Project

Step 1: Clone the Repository

Step 2: Build and Run the Docker Containers

Interacting with the API

Start a Crawl Job

Example Request:

Example Response:

Check Job Status and Retrieve Results

Get Job Status:

Example Request:

Example Response:

Retrieve Filtered Crawl Data

Example Request:

Example Response:

Download Full Crawl Results

Example Request:

Error Handling

Services

FastAPI

Redis

Worker

File Structure

Customizing the Crawler

Troubleshooting

Common Issues

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 2

Uh oh!

Languages

Packages