This project is a Node.js application that uses Puppeteer to crawl web pages, extract metadata (title, description, body), and classify the content into relevant topics. It can be run locally and deployed to AWS Lambda.
- Crawl web pages and extract metadata using Puppeteer.
- Parse HTML content and extract title, description, and body using Cheerio.
- Classify page content and identify relevant topics using TF-IDF.
- Node.js and npm
- Docker (for local AWS Lambda testing)
- Serverless Framework (for deployment)
-
Clone the repository:
git clone https://github.com/pandeakash/puppeteer-web-crawler.git cd puppeteer-web-crawler
-
Install the dependencies:
npm install
-
Running the project locally
- Transpile TypeScript to JavaScript
yarn run build
- Build the nodejs server
yarn run start.
- Transpile TypeScript to JavaScript
-
Accessing / Testing API
curl "http://localhost:8000/crawl-page?url=http://www.cnn.com/2013/06/10/politics/edward-snowden-profile/"
- Can add a feature to capture screenshots of the page and store them in cloud storage
- Can add a feature to extract all links on the page and add them in the database