@@ -13,7 +13,8 @@ and open-source libraries
13
13
`html-text <https://github.com/TeamHG-Memex/html-text >`_,
14
14
`trafilatura <https://github.com/adbar/trafilatura >`_,
15
15
`go-readability <https://github.com/go-shiori/go-readability >`_,
16
- `Readability.js <https://github.com/mozilla/readability >`_.
16
+ `Readability.js <https://github.com/mozilla/readability >`_,
17
+ `html2text <https://github.com/Alir3z4/html2text >`_.
17
18
We release evaluation datasets and scripts,
18
19
and provide more details in a whitepaper.
19
20
@@ -44,6 +45,7 @@ Result of packages added after original evaluation::
44
45
trafilatura precision=0.925 ± 0.011 recall=0.966 ± 0.009 F1=0.945 ± 0.009 accuracy=0.221 ± 0.031
45
46
go_readability precision=0.912 ± 0.009 recall=0.975 ± 0.007 F1=0.943 ± 0.007 accuracy=0.210 ± 0.030
46
47
readability_js precision=0.853 ± 0.013 recall=0.924 ± 0.012 F1=0.887 ± 0.012 accuracy=0.149 ± 0.026
48
+ html2text precision=0.499 ± 0.017 recall=0.983 ± 0.002 F1=0.662 ± 0.015 accuracy=0.000 ± 0.000
47
49
48
50
Below you can find more details about the packages and result reproduction.
49
51
@@ -102,6 +104,7 @@ or external resources:
102
104
at https://github.com/scrapinghub/article-extraction-benchmark/pull/4
103
105
- go-readability: https://github.com/go-shiori/go-readability
104
106
- Readability.js: https://github.com/mozilla/readability
107
+ - html2text: https://github.com/Alir3z4/html2text
105
108
106
109
Output from these libraries is already present in the repo in ``output/*.json `` files.
107
110
They were generated with ``extractors/run_*.py `` files.
0 commit comments