Some website only some information after you login.
"Active tab" means this crawler will run inside a browser, which should works with any web UI.
What you see is what you can crawl.
Pure HTML or Shadow DOM data is pulled, as if you are viewing the webpage.
Cannot pull large amount of data in a very short time.
Too many webpage request / RESTful pull request may resulting your IP being blocked.
Browser like Chrome, Firefox
Browser development console
(Optional) jQuery. ( You can use tampermonkey extension to sideload script files into webpages. checkout /tampermonkey/jQuery.js)
This is good for static-ish webpages.
A sign for these webpages are\
- usually very structure, where using jQuery between pages is very simple
- each data for each webpages can be access through Url, either direct *.html, or RESTful
You can access different vocabs directly from the Urls.
(https://www.dictionary.com/browse/banana)\
(https://www.dictionary.com/browse/admin)
I will use hkcards and Finviz as an example.
(https://www.hkcards.com/cj/cj-char-丁.html)\ (https://www.hkcards.com/cj/cj-char-解.html)\ (https://www.hkcards.com/cj/cj-char-點.html)\
(https://finviz.com/screener.ashx?v=111&f=idx_sp500&ft=4&o=-marketcap&r=1)\ (https://finviz.com/screener.ashx?v=111&f=idx_sp500&ft=4&o=-marketcap&r=21)
This is case by case. Using dictionary.com as an example.\
const baseURL = 'https://www.dictionary.com/browse/';
// ajaxRequestData is what you pass to the url: http://example.com?r=1
var ajaxRequestData = [];
for (let w of words) {
ajaxRequestData.push(baseURL + w);
}
Just plug your step 2 code into the script
You call allProcessedData variable in the dev console, than right-click "Copy Object" to save it.
Some webpages only load data when you click a button, and cannot request data directly from Urls.
Those are harder to crawl.