Skip to content

zjkang/distributed_web_crawler

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

25 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

About

This web crawler consists of a manager and a bunch of workers which can work in single-node mode or cluster mode. Each worker downloads web pages and communicates the server with urls extracted from the downloaded pages. Manager is responsible for scheduling crawling tasks among the workers with consideration of load balance. Manager also handles dynamically connected/disconnected workers.

Installation Dependencies

Before you can run this crawler, you may need to download and install:

  1. BeautifulSoup
  2. zmq core lib and pyzmq
  3. mongodb and pymongo

Try it:

  1. Start the manager on master node.

     Usage: crawlerManager.py [options]
     Options:
       -h, --help            show this help message and exit
       -f FILE, --file=FILE  the file which contains the web sites from which to
                             start crawling, ./conf/seeds.cfg is used by default.
       -p REGPORT, --port=REGPORT
                             port on which connection requests are expected.
       -d URLPORT, --urlPort=URLPORT
                             port on which urls are sent to workers.
    
  2. Start workers on master or any other hosts.

     Usage: crawlerWorker.py [options]
     Options:
       -h, --help            show this help message and exit
       -m MANAGER, --manager=MANAGER
                             the name/ip of the host on which manager is started.
       -p REGPORT, --port=REGPORT
                             port to connect manager.
       -d DOWNLOADERS, --download=DOWNLOADERS
                             number of threads which download web pages. 4 by
                             default.
    

Design:

design.png

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Packages

No packages published