@@ -19,7 +19,8 @@ and open-source libraries
19
19
`Goose3 <https://github.com/goose3/goose3 >`_,
20
20
`inscriptis <https://github.com/weblyzard/inscriptis >`_,
21
21
`html2text <https://github.com/Alir3z4/html2text >`_,
22
- `jusText <https://github.com/miso-belica/jusText >`_.
22
+ `jusText <https://github.com/miso-belica/jusText >`_,
23
+ `BeautifulSoup <https://www.crummy.com/software/BeautifulSoup/bs4/doc/ >`_.
23
24
We release evaluation datasets and scripts,
24
25
and provide more details in a whitepaper.
25
26
@@ -56,6 +57,7 @@ Result of packages added after original evaluation::
56
57
inscriptis precision=0.517 ± 0.017 recall=0.993 ± 0.001 F1=0.679 ± 0.015 accuracy=0.000 ± 0.000
57
58
html2text precision=0.499 ± 0.017 recall=0.983 ± 0.002 F1=0.662 ± 0.015 accuracy=0.000 ± 0.000
58
59
justext precision=0.858 ± 0.017 recall=0.754 ± 0.028 F1=0.802 ± 0.018 accuracy=0.088 ± 0.021
60
+ beautifulsoup precision=0.499 ± 0.017 recall=0.994 ± 0.001 F1=0.665 ± 0.015 accuracy=0.000 ± 0.000
59
61
60
62
Below you can find more details about the packages and result reproduction.
61
63
@@ -123,6 +125,8 @@ or external resources:
123
125
converts HTML pages to Markup language
124
126
- jusText: https://github.com/miso-belica/jusText -
125
127
Heuristic based boilerplate removal tool
128
+ - BeautifulSoup: https://www.crummy.com/software/BeautifulSoup/bs4/doc/ -
129
+ Python library for pulling data out of HTML and XML files.
126
130
127
131
Output from these libraries is already present in the repo in ``output/*.json `` files.
128
132
They were generated with ``extractors/run_*.py `` files.
0 commit comments