Skip to content

Commit 9435f4c

Browse files
Add html2text
1 parent af32b50 commit 9435f4c

File tree

4 files changed

+576
-1
lines changed

4 files changed

+576
-1
lines changed

README.rst

Lines changed: 4 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -13,7 +13,8 @@ and open-source libraries
1313
`html-text <https://github.com/TeamHG-Memex/html-text>`_,
1414
`trafilatura <https://github.com/adbar/trafilatura>`_,
1515
`go-readability <https://github.com/go-shiori/go-readability>`_,
16-
`Readability.js <https://github.com/mozilla/readability>`_.
16+
`Readability.js <https://github.com/mozilla/readability>`_,
17+
`html2text <https://github.com/Alir3z4/html2text>`_.
1718
We release evaluation datasets and scripts,
1819
and provide more details in a whitepaper.
1920

@@ -44,6 +45,7 @@ Result of packages added after original evaluation::
4445
trafilatura precision=0.925 ± 0.011 recall=0.966 ± 0.009 F1=0.945 ± 0.009 accuracy=0.221 ± 0.031
4546
go_readability precision=0.912 ± 0.009 recall=0.975 ± 0.007 F1=0.943 ± 0.007 accuracy=0.210 ± 0.030
4647
readability_js precision=0.853 ± 0.013 recall=0.924 ± 0.012 F1=0.887 ± 0.012 accuracy=0.149 ± 0.026
48+
html2text precision=0.499 ± 0.017 recall=0.983 ± 0.002 F1=0.662 ± 0.015 accuracy=0.000 ± 0.000
4749

4850
Below you can find more details about the packages and result reproduction.
4951

@@ -102,6 +104,7 @@ or external resources:
102104
at https://github.com/scrapinghub/article-extraction-benchmark/pull/4
103105
- go-readability: https://github.com/go-shiori/go-readability
104106
- Readability.js: https://github.com/mozilla/readability
107+
- html2text: https://github.com/Alir3z4/html2text
105108

106109
Output from these libraries is already present in the repo in ``output/*.json`` files.
107110
They were generated with ``extractors/run_*.py`` files.

extractors/run_html2text.py

Lines changed: 26 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,26 @@
1+
#!/usr/bin/env python3
2+
import gzip
3+
import json
4+
from pathlib import Path
5+
6+
from html2text import HTML2Text
7+
8+
9+
def main():
10+
output = {}
11+
for path in Path('html').glob('*.html.gz'):
12+
with gzip.open(path, 'rt', encoding='utf8') as f:
13+
html = f.read()
14+
item_id = path.stem.split('.')[0]
15+
h = HTML2Text()
16+
h.ignore_links = True
17+
h.ignore_images = True
18+
content = h.handle(html)
19+
output[item_id] = {'articleBody': content}
20+
(Path('output') / 'html2text.json').write_text(
21+
json.dumps(output, sort_keys=True, ensure_ascii=False, indent=4),
22+
encoding='utf8')
23+
24+
25+
if __name__ == '__main__':
26+
main()

0 commit comments

Comments
 (0)