This is simple aggregation framework for web data using javascript.
This framework need request_template and job meta data to be passed to highest level API defined in lib/main.js. The request_template and job meta data must be javascript object type. You can define request_template and job meta data as JSON files, parse them as object, then call scrape() from lib/main.js.
Lets describe each competent one by one:
- request_template describe web scrapping logic. You can put many websites as you want differentiate by
id. - job describe web scraping execution. You can create many jobs as you want differentiate by
id. Each job may execute multiplerequest iddefined inrequest_templateby stored them intorequestfield as array. - API Client describe which
jobshould be executed bylib/main.js. - scrape() is a highest level API defined by
lib/main.jswhich maintain all web request objects. - Job queue is a job que manager which manage and execute web request objects.
Below recommended step by step how you use this tool:
-
Create
request_templatemeta data as JSON file. Put it ininputdirectory if you want straight forward implementation or any directories in your project. Therequest_templatemeta data should follow any rules defined in request documentation. Below an example ofrequest_template:{ "<REQUEST_ID>": { "url": <URL>, "fillable": ["qs.<KEY>", "qs.<KEY>", ..., "qs.<KEY>"], "method": "GET", "selector": { "<SELECTOR_ID>": "<SELECTOR>" } }, .... } -
Create
jobmeta data as JSON file. Put it ininputdirectory if you want straight forward implementation or any directories in your project. Thejobmeta data should follow any rules defined in job documentation. Below an example ofjob:{ "<JOB_ID>" : { "type": "JOB_TYPE", "request": [ { "<REQUEST_ID>": { "fill": { "qs.<KEY>": <INPUT>, ..., "qs.<KEY>": <INPUT> } } } ], "cache": 1, "output": { "type": "<OUTPUT_TYPE>", "name": "<OUTPUT_NAME>", "format": { "<KEY>": "<COMMAND>", ..., "<KEY>": "<COMMAND>" } }, "priority": <PRIORITY_LEVEL>, "delay": <DELAY_MS>, "thread": <NUMBER_OF_THREAD>, "status": <JOB_STATUS> }, ... } -
Create new API client as javascript file in one of this directory:
- test directory if you want to do unit testing.
- This root directory if you want to straight forward implementation.
- Your project directory where you can access
web-aggregation/lib/main.js.
Below an example of API client if you create it in this root directory:
const fs = require('fs'); const path = require('path'); const scraper = require('lib/main.js'); let jobs = JSON.parse(fs.readFileSync( path.join(__dirname, '..', 'examples', <JOB_JSON_FILE>), 'utf8') ); let requestTemplates = JSON.parse(fs.readFileSync( path.join(__dirname, '..', 'examples', <REQUEST_JSON_FILE>), 'utf8') ); scraper.scrape(jobs, requestTemplates); -
If you are in straight forward implementation, then just execute your API client:
node path/to/client.js. -
Create queue management dashboard to monitor job status by execute
node_modules/kue/bin/kue-dashboard -p 3050 -r redis: //127.0.0.1:3000
- Go to example directory for working JSON example.
- Go to test directory for working script example.
- Go to docs directory for straight forward guides.
node_modules/kue/bin/kue-dashboard -p 3050 -r redis: //127.0.0.1:3000
