Web Crawler Implementation - Rishabh Nag

Challenge

Write a simple web crawler. The crawler should be limited to one domain - so when crawling tomblomfield.com it would crawl all pages within the domain, but not follow external links, for example to the Facebook and Twitter accounts. Given a URL, it should output a site map, showing which static assets each page depends on, and the links between pages.

#Implementation Implemented with NodeJS, using cheerio and request packages to simplify HTTP calls and parsing body, outputs sitemap as JSON file Also attempted to partially implement with Go

#JS Steps

npm install
node main http://tomblomfield.com
Optional output file name argument (defaults to results.json)

node main http://tomblomfield.com -o tom.json

#Go Steps

cd go
go run crawl.go http://tomblomfield.com
Sitemap just outputs into console

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
.idea		.idea
go		go
src		src
README.md		README.md
main.js		main.js
package.json		package.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Web Crawler Implementation - Rishabh Nag

Challenge

About

Uh oh!

Releases

Packages

Uh oh!

Languages

rishabhnag1/web-crawler

Folders and files

Latest commit

History

Repository files navigation

Web Crawler Implementation - Rishabh Nag

Challenge

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages