This web crawler consists of a manager and a bunch of workers which can work in single-node mode or cluster mode. Each worker downloads web pages and communicates the server with urls extracted from the downloaded pages. Manager is responsible for scheduling crawling tasks among the workers with consideration of load balance. Manager also handles dynamically connected/disconnected workers.
Before you can run this crawler, you may need to download and install:
- BeautifulSoup
- zmq core lib and pyzmq
- mongodb and pymongo
-
Start the manager on master node.
Usage: crawlerManager.py [options] Options: -h, --help show this help message and exit -f FILE, --file=FILE the file which contains the web sites from which to start crawling, ./conf/seeds.cfg is used by default. -p REGPORT, --port=REGPORT port on which connection requests are expected. -d URLPORT, --urlPort=URLPORT port on which urls are sent to workers.
-
Start workers on master or any other hosts.
Usage: crawlerWorker.py [options] Options: -h, --help show this help message and exit -m MANAGER, --manager=MANAGER the name/ip of the host on which manager is started. -p REGPORT, --port=REGPORT port to connect manager. -d DOWNLOADERS, --download=DOWNLOADERS number of threads which download web pages. 4 by default.