GitHub

About

This web crawler consists of a manager and a bunch of workers which can work in single-node mode or cluster mode. Each worker downloads web pages and communicates the server with urls extracted from the downloaded pages. Manager is responsible for scheduling crawling tasks among the workers with consideration of load balance. Manager also handles dynamically connected/disconnected workers.

Installation Dependencies

Before you can run this crawler, you may need to download and install:

Try it:

Start the manager on master node.

 Usage: crawlerManager.py [options]
 Options:
   -h, --help            show this help message and exit
   -f FILE, --file=FILE  the file which contains the web sites from which to
                         start crawling, ./conf/seeds.cfg is used by default.
   -p REGPORT, --port=REGPORT
                         port on which connection requests are expected.
   -d URLPORT, --urlPort=URLPORT
                         port on which urls are sent to workers.

Start workers on master or any other hosts.

 Usage: crawlerWorker.py [options]
 Options:
   -h, --help            show this help message and exit
   -m MANAGER, --manager=MANAGER
                         the name/ip of the host on which manager is started.
   -p REGPORT, --port=REGPORT
                         port to connect manager.
   -d DOWNLOADERS, --download=DOWNLOADERS
                         number of threads which download web pages. 4 by
                         default.

Name		Name	Last commit message	Last commit date
Latest commit History 25 Commits
conf		conf
core		core
lib		lib
test		test
README.md		README.md
__init__.py		__init__.py
crawlerManager.py		crawlerManager.py
crawlerWorker.py		crawlerWorker.py
design.png		design.png
test.py		test.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

About

Installation Dependencies

Try it:

Design:

About

Uh oh!

Releases

Packages

zjkang/distributed_web_crawler

Folders and files

Latest commit

History

Repository files navigation

About

Installation Dependencies

Try it:

Design:

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Packages