Skip to content

Commit 17e4299

Browse files
committed
Added jusText.
1 parent 90d3032 commit 17e4299

File tree

4 files changed

+576
-1
lines changed

4 files changed

+576
-1
lines changed

README.rst

Lines changed: 5 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -18,7 +18,8 @@ and open-source libraries
1818
`news-please <https://github.com/fhamborg/news-please>`_.
1919
`Goose3 <https://github.com/goose3/goose3>`_,
2020
`inscriptis <https://github.com/weblyzard/inscriptis>`_,
21-
`html2text <https://github.com/Alir3z4/html2text>`_.
21+
`html2text <https://github.com/Alir3z4/html2text>`_,
22+
`jusText <https://github.com/miso-belica/jusText>`_.
2223
We release evaluation datasets and scripts,
2324
and provide more details in a whitepaper.
2425

@@ -54,6 +55,7 @@ Result of packages added after original evaluation::
5455
goose3 precision=0.930 ± 0.015 recall=0.847 ± 0.021 F1=0.887 ± 0.016 accuracy=0.227 ± 0.032
5556
inscriptis precision=0.517 ± 0.017 recall=0.993 ± 0.001 F1=0.679 ± 0.015 accuracy=0.000 ± 0.000
5657
html2text precision=0.499 ± 0.017 recall=0.983 ± 0.002 F1=0.662 ± 0.015 accuracy=0.000 ± 0.000
58+
justext precision=0.858 ± 0.017 recall=0.754 ± 0.028 F1=0.802 ± 0.018 accuracy=0.088 ± 0.021
5759

5860
Below you can find more details about the packages and result reproduction.
5961

@@ -119,6 +121,8 @@ or external resources:
119121
converts HTML to text with a particular emphasis on nested tables
120122
- html2text: https://github.com/Alir3z4/html2text -
121123
converts HTML pages to Markup language
124+
- jusText: https://github.com/miso-belica/jusText -
125+
Heuristic based boilerplate removal tool
122126

123127
Output from these libraries is already present in the repo in ``output/*.json`` files.
124128
They were generated with ``extractors/run_*.py`` files.

extractors/run_justext.py

Lines changed: 25 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,25 @@
1+
#!/usr/bin/env python3
2+
import gzip
3+
import json
4+
from pathlib import Path
5+
6+
import justext
7+
8+
9+
def main():
10+
output = {}
11+
for path in Path('html').glob('*.html.gz'):
12+
with gzip.open(path, 'rt', encoding='utf8') as f:
13+
html = f.read()
14+
item_id = path.stem.split('.')[0]
15+
article = ' '.join(
16+
[p.text for p in justext.justext(html, justext.get_stoplist("English"), 50, 200, 0.1, 0.2, 0.2, 200, True)
17+
if not p.is_boilerplate])
18+
output[item_id] = {'articleBody': article}
19+
(Path('output') / 'justext.json').write_text(
20+
json.dumps(output, sort_keys=True, ensure_ascii=False, indent=4),
21+
encoding='utf8')
22+
23+
24+
if __name__ == '__main__':
25+
main()

0 commit comments

Comments
 (0)