Puppeteer Web Crawler

This project is a Node.js application that uses Puppeteer to crawl web pages, extract metadata (title, description, body), and classify the content into relevant topics. It can be run locally and deployed to AWS Lambda.

Features

Crawl web pages and extract metadata using Puppeteer.
Parse HTML content and extract title, description, and body using Cheerio.
Classify page content and identify relevant topics using TF-IDF.

Prerequisites

Node.js and npm
Docker (for local AWS Lambda testing)
Serverless Framework (for deployment)

Getting Started

Installation

Clone the repository:

git clone https://github.com/pandeakash/puppeteer-web-crawler.git
cd puppeteer-web-crawler

Install the dependencies:
```
npm install
```
Running the project locally
1. Transpile TypeScript to JavaScript
```
   yarn run build
```
2. Build the nodejs server
```
   yarn run start.
```

Accessing / Testing API

 curl "http://localhost:8000/crawl-page?url=http://www.cnn.com/2013/06/10/politics/edward-snowden-profile/"

Future scope

Can add a feature to capture screenshots of the page and store them in cloud storage
Can add a feature to extract all links on the page and add them in the database

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
src		src
.env		.env
.eslintrc.js		.eslintrc.js
.gitignore		.gitignore
.prettierrc		.prettierrc
Dockerfile		Dockerfile
README.md		README.md
package-lock.json		package-lock.json
package.json		package.json
serverless.yaml		serverless.yaml
tsconfig.json		tsconfig.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Puppeteer Web Crawler

Features

Prerequisites

Getting Started

Installation

Future scope

About

Uh oh!

Releases

Packages

Uh oh!

Languages

pandeAkash/puppeter-web-crawler

Folders and files

Latest commit

History

Repository files navigation

Puppeteer Web Crawler

Features

Prerequisites

Getting Started

Installation

Future scope

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages