Skip to content

Commit 58f6ab2

Browse files
committed
Merge branch 'master' into use-streams-instead-of-inmemory-storage
2 parents 0ad6c82 + 8d32bc5 commit 58f6ab2

File tree

26 files changed

+276
-5947
lines changed

26 files changed

+276
-5947
lines changed

.github/dependabot.yml

Lines changed: 9 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,9 @@
1+
version: 2
2+
updates:
3+
- package-ecosystem: "npm"
4+
directory: "/"
5+
assignees:
6+
- "s0ph1e"
7+
open-pull-requests-limit: 10
8+
schedule:
9+
interval: "weekly"

.github/workflows/node.js.yml

Lines changed: 27 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -5,27 +5,46 @@ on:
55
branches: [ master ]
66
pull_request:
77
branches: [ master ]
8+
schedule:
9+
- cron: '13 2 * * *'
810

911
jobs:
1012
test:
11-
12-
runs-on: ubuntu-latest
13-
13+
runs-on: ${{ matrix.os }}
1414
strategy:
15+
fail-fast: false
1516
matrix:
16-
node-version: [12.x, 14.x, 15.x]
17-
# See supported Node.js release schedule at https://nodejs.org/en/about/releases/
17+
node-version:
18+
- 14
19+
- 16
20+
- 17
21+
os:
22+
- ubuntu-latest
23+
- windows-latest
24+
include:
25+
- node-version: 16
26+
os: macos-latest
1827

1928
steps:
2029
- uses: actions/checkout@v2
2130
- name: Use Node.js ${{ matrix.node-version }}
2231
uses: actions/setup-node@v2
2332
with:
2433
node-version: ${{ matrix.node-version }}
25-
- run: npm ci
34+
- run: npm i
2635
- run: npm test
27-
- name: Coveralls
28-
if: ${{ matrix.node-version == '14.x' }}
36+
- run: npm run eslint
37+
if: ${{ matrix.node-version == '16' && matrix.os == 'ubuntu-latest' }}
38+
- name: Coveralls
39+
if: ${{ matrix.node-version == '16' && matrix.os == 'ubuntu-latest' }}
2940
uses: coverallsapp/github-action@master
3041
with:
3142
github-token: ${{ secrets.GITHUB_TOKEN }}
43+
- name: Publish codeclimate code coverage
44+
if: ${{ matrix.node-version == '16' && matrix.os == 'ubuntu-latest' }}
45+
uses: paambaati/[email protected]
46+
env:
47+
CC_TEST_REPORTER_ID: ${{ secrets.CC_TEST_REPORTER_ID }}
48+
with:
49+
coverageLocations: |
50+
${{github.workspace}}/coverage/lcov.info:lcov

.gitignore

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,7 @@
11
.idea
22
.DS_Store
33
node_modules
4+
package-lock.json
45
npm-debug.log
56
coverage
67
test/e2e/results

BACKERS.md

Lines changed: 8 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -2,8 +2,14 @@
22

33
Website-scraper module is an Open Source Software maintained mostly by one developer in free time.
44
Maintaining and developing new features to this project takes a considerable amount of time.
5-
If website-scraper has helped you in your work or personal projects - you're welcome to make a recurring pledge.
6-
If you want to thank the author you can use [Patreon](https://www.patreon.com/s0ph1e).
5+
If website-scraper has helped you in your work or personal projects - you're welcome to make a recurring or one-time pledge.
6+
7+
If you want to thank the author you can use [Patreon](https://www.patreon.com/s0ph1e) or [GitHub Sponspors](https://github.com/sponsors/s0ph1e/).
8+
Thank you for your support ❤️
79

810
### Backers via Patreon
911
* The Rubyist
12+
13+
### Backers via GitHub Sponspors
14+
* swcarlosrj
15+
* francescamarano

README.md

Lines changed: 2 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -1,12 +1,8 @@
1-
# ⚠️ Beware: this documentation is for next major version 5. Most probably you need current version [4.x documentation](https://github.com/website-scraper/node-website-scraper/blob/4.x/README.md)
2-
31
[![Version](https://img.shields.io/npm/v/website-scraper.svg?style=flat)](https://www.npmjs.org/package/website-scraper)
42
[![Downloads](https://img.shields.io/npm/dm/website-scraper.svg?style=flat)](https://www.npmjs.org/package/website-scraper)
53
[![Node.js CI](https://github.com/website-scraper/node-website-scraper/actions/workflows/node.js.yml/badge.svg)](https://github.com/website-scraper/node-website-scraper)
6-
[![Build status](https://ci.appveyor.com/api/projects/status/s7jxui1ngxlbgiav/branch/master?svg=true)](https://ci.appveyor.com/project/s0ph1e/node-website-scraper/branch/master)
74
[![Test Coverage](https://codeclimate.com/github/website-scraper/node-website-scraper/badges/coverage.svg)](https://codeclimate.com/github/website-scraper/node-website-scraper/coverage)
8-
[![Dependency Status](https://david-dm.org/website-scraper/node-website-scraper.svg?style=flat)](https://david-dm.org/website-scraper/node-website-scraper)
9-
[![Gitter](https://badges.gitter.im/website-scraper/node-website-scraper.svg)](https://gitter.im/website-scraper/node-website-scraper?utm_source=badge&utm_medium=badge&utm_campaign=pr-badge)
5+
[![Coverage Status](https://coveralls.io/repos/github/website-scraper/node-website-scraper/badge.svg?branch=master)](https://coveralls.io/github/website-scraper/node-website-scraper?branch=master)
106

117

128
# website-scraper
@@ -23,7 +19,7 @@ Try it in [demo app](https://scraper.nepochataya.pp.ua/) ([source](https://githu
2319
This module is an Open Source Software maintained by one developer in free time. If you want to thank the author of this module you can use [GitHub Sponsors](https://github.com/sponsors/s0ph1e) or [Patreon](https://www.patreon.com/s0ph1e).
2420

2521
## Requirements
26-
* nodejs version >= 12.17.0
22+
* nodejs version >= 14.14
2723
* website-scraper v5 is pure ESM (it doesn't work with CommonJS), [read more in release v5.0.0 docs](https://github.com/website-scraper/node-website-scraper/releases/tag/v5.0.0)
2824

2925
## Installation

appveyor.yml

Lines changed: 0 additions & 17 deletions
This file was deleted.

lib/resource-handler/html/html-source-element.js

Lines changed: 2 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,6 @@
11
import ImgSrcsetTag from '../path-containers/html-img-srcset-tag.js';
22
import CommonTag from '../path-containers/html-common-tag.js';
33
import CssText from '../path-containers/css-text.js';
4-
import { decodeHtmlEntities, encodeHtmlEntities } from '../../utils/index.js';
54

65
const pathContainersByRule = [
76
{ selector: '[style]', attr: 'style', containerClass: CssText },
@@ -29,18 +28,15 @@ class HtmlSourceElement {
2928
* @returns {string}
3029
*/
3130
getData () {
32-
const text = this.rule.attr ? this.el.attr(this.rule.attr) : this.el.text();
33-
return decodeHtmlEntities(text);
31+
return this.rule.attr ? this.el.attr(this.rule.attr) : this.el.html();
3432
}
3533

3634
/**
3735
* Update attribute or inner text of el with new data
3836
* @param {string} newData
3937
*/
4038
setData (newData) {
41-
// todo: encode can be removed after https://github.com/cheeriojs/cheerio/issues/957 fixed
42-
const escapedData = encodeHtmlEntities(newData);
43-
this.rule.attr ? this.el.attr(this.rule.attr, escapedData) : this.el.text(newData);
39+
this.rule.attr ? this.el.attr(this.rule.attr, newData) : this.el.html(newData);
4440
}
4541

4642
removeIntegrityCheck () {

lib/resource-handler/html/index.js

Lines changed: 1 addition & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -81,10 +81,7 @@ function prepareToLoad ($, resource) {
8181
}
8282

8383
function loadTextToCheerio (text) {
84-
return cheerio.load(text, {
85-
decodeEntities: false,
86-
lowerCaseAttributeNames: false,
87-
});
84+
return cheerio.load(text);
8885
}
8986

9087
export default HtmlResourceHandler;

lib/resource-handler/path-containers/html-img-srcset-tag.js

Lines changed: 5 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -1,10 +1,9 @@
1-
import srcset from 'srcset';
2-
import _ from 'lodash';
1+
import { parseSrcset, stringifySrcset } from 'srcset';
32

43
class HtmlImgSrcSetTag {
54
constructor (text) {
65
this.text = text || '';
7-
this.imgSrcsetParts = srcset.parse(this.text);
6+
this.imgSrcsetParts = parseSrcset(this.text);
87
this.paths = this.imgSrcsetParts.map(imgSrcset => imgSrcset.url);
98
}
109

@@ -14,13 +13,13 @@ class HtmlImgSrcSetTag {
1413

1514
updateText (pathsToUpdate) {
1615
const imgSrcsetParts = this.imgSrcsetParts;
17-
pathsToUpdate.forEach(function updatePath (path) {
18-
const srcsToUpdate = _.filter(imgSrcsetParts, {url: path.oldPath});
16+
pathsToUpdate.forEach((path) => {
17+
const srcsToUpdate = imgSrcsetParts.filter(imgSrcsetPart => imgSrcsetPart.url === path.oldPath);
1918
srcsToUpdate.forEach((srcToUpdate) => {
2019
srcToUpdate.url = path.newPath;
2120
});
2221
});
23-
return srcset.stringify(imgSrcsetParts);
22+
return stringifySrcset(imgSrcsetParts);
2423
}
2524
}
2625

lib/scraper.js

Lines changed: 3 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -291,13 +291,14 @@ class Scraper {
291291

292292
async requestResource (resource) {
293293
const url = resource.getUrl();
294+
const depth = resource.getDepth();
294295

295-
if (this.options.urlFilter && !this.options.urlFilter(url)) {
296+
if (this.options.urlFilter && depth > 0 && !this.options.urlFilter(url)) {
296297
logger.debug('filtering out ' + resource + ' by url filter');
297298
return null;
298299
}
299300

300-
if (this.options.maxDepth && resource.getDepth() > this.options.maxDepth) {
301+
if (this.options.maxDepth && depth > this.options.maxDepth) {
301302
logger.debug('filtering out ' + resource + ' by depth');
302303
return null;
303304
}

lib/utils/index.js

Lines changed: 0 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,6 @@
11
import url from 'url';
22
import path from 'path';
33
import normalize from 'normalize-url';
4-
import htmlEntities from 'he';
54
import _ from 'lodash';
65
import typeByMime from '../config/resource-type-by-mime.js';
76
import typeByExt from '../config/resource-type-by-ext.js';
@@ -134,14 +133,6 @@ function getTypeByFilename (filename) {
134133
return typeByExt[ext];
135134
}
136135

137-
function decodeHtmlEntities (text) {
138-
return typeof text === 'string' ? htmlEntities.decode(text) : '';
139-
}
140-
141-
function encodeHtmlEntities (text) {
142-
return typeof text === 'string' ? htmlEntities.escape(text) : '';
143-
}
144-
145136
function extend (first, second) {
146137
return Object.assign({}, first, second);
147138
}
@@ -187,8 +178,6 @@ export {
187178
isUriSchemaSupported,
188179
getTypeByMime,
189180
getTypeByFilename,
190-
decodeHtmlEntities,
191-
encodeHtmlEntities,
192181
extend,
193182
union,
194183
isPlainObject,

0 commit comments

Comments
 (0)