webcrawler

User Requirements:

1. It should be limited to one domain:

so when crawling example.com it would crawl all pages within the example.com domain, but not follow the links to Facebook or Instagram accounts or subdomains like cloud.example.com.

2. Given a URL, it should output a site map:

showing which static assets each page depends on, and the links between pages. Choose the most appropriate data structure to store & display this site map.

External package requirements:

a. reppy

-Install using "pip install reppy"

TODO

Improve memory usage

1. Modify the sitemap data structure to dumps to file after every crawl rather than store the entire site and dump at the end since it can grow really large

Add unit tests

1.To check url parsing

Name		Name	Last commit message	Last commit date
Latest commit History 36 Commits
src		src
.gitignore		.gitignore
README.md		README.md
functional_spec.pdf		functional_spec.pdf

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

webcrawler

User Requirements:

1. It should be limited to one domain:

2. Given a URL, it should output a site map:

External package requirements:

a. reppy

TODO

Improve memory usage

Add unit tests

About

Uh oh!

Releases

Packages

Languages

tharakkrishnan/webcrawler

Folders and files

Latest commit

History

Repository files navigation

webcrawler

User Requirements:

1. It should be limited to one domain:

2. Given a URL, it should output a site map:

External package requirements:

a. reppy

TODO

Improve memory usage

Add unit tests

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages