Active Tab Webpage Crawler

Definitons

Some website only some information after you login.
"Active tab" means this crawler will run inside a browser, which should works with any web UI.

Benefits

What you see is what you can crawl.
Pure HTML or Shadow DOM data is pulled, as if you are viewing the webpage.

Disadvantage

Cannot pull large amount of data in a very short time.
Too many webpage request / RESTful pull request may resulting your IP being blocked.

Prerequisite

Browser like Chrome, Firefox
Browser development console
(Optional) jQuery. ( You can use tampermonkey extension to sideload script files into webpages. checkout /tampermonkey/jQuery.js)

Pure HTML and RESTful processing crawler

This is good for static-ish webpages.
A sign for these webpages are\

usually very structure, where using jQuery between pages is very simple
each data for each webpages can be access through Url, either direct *.html, or RESTful

Example

Dictionary.com/

You can access different vocabs directly from the Urls.
(https://www.dictionary.com/browse/banana)\ (https://www.dictionary.com/browse/admin)

Steps

I will use hkcards and Finviz as an example.

1. Understand how to request data from Urls

(https://www.hkcards.com/cj/cj-char-丁.html)\ (https://www.hkcards.com/cj/cj-char-解.html)\ (https://www.hkcards.com/cj/cj-char-點.html)\

(https://finviz.com/screener.ashx?v=111&f=idx_sp500&ft=4&o=-marketcap&r=1)\ (https://finviz.com/screener.ashx?v=111&f=idx_sp500&ft=4&o=-marketcap&r=21)

2. Query the data you need inside the webpage

3. Write the logic to navigate to all Urls

This is case by case. Using dictionary.com as an example.\

const baseURL = 'https://www.dictionary.com/browse/';

// ajaxRequestData is what you pass to the url: http://example.com?r=1
var ajaxRequestData = [];

for (let w of words) {
  ajaxRequestData.push(baseURL + w);
}

4. re-write the code to handle your data

Just plug your step 2 code into the script

5. Copy the whole script into Browser dev console to let it run

6. Save your data into a json file

You call allProcessedData variable in the dev console, than right-click "Copy Object" to save it.

Non-RESTful dynamic webpage crawler

Some webpages only load data when you click a button, and cannot request data directly from Urls.
Those are harder to crawl.

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
assets		assets
tampermonkey		tampermonkey
CrawlCrunchbaseInvestors.js		CrawlCrunchbaseInvestors.js
GoogleContactsDownloader		GoogleContactsDownloader
LICENSE		LICENSE
README.md		README.md
background.js		background.js
content_script.js		content_script.js
crawlChangJie.js		crawlChangJie.js

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Active Tab Webpage Crawler

Definitons

Benefits

Disadvantage

Prerequisite

Pure HTML and RESTful processing crawler

Example

Steps

1. Understand how to request data from Urls

2. Query the data you need inside the webpage

3. Write the logic to navigate to all Urls

4. re-write the code to handle your data

5. Copy the whole script into Browser dev console to let it run

6. Save your data into a json file

Non-RESTful dynamic webpage crawler

About

Releases

Packages

Languages

License

jacobklo/ActiveTabWebpageCrawlerSynchronous

Folders and files

Latest commit

History

Repository files navigation

Active Tab Webpage Crawler

Definitons

Benefits

Disadvantage

Prerequisite

Pure HTML and RESTful processing crawler

Example

Steps

1. Understand how to request data from Urls

2. Query the data you need inside the webpage

3. Write the logic to navigate to all Urls

4. re-write the code to handle your data

5. Copy the whole script into Browser dev console to let it run

6. Save your data into a json file

Non-RESTful dynamic webpage crawler

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages