Skip to content

Commit 91b9a72

Browse files
authored
Merge pull request scrapinghub#21 from ldulcic/beautifulsoup
Added BeautifulSoup
2 parents 319422a + 377ab15 commit 91b9a72

File tree

4 files changed

+575
-1
lines changed

4 files changed

+575
-1
lines changed

README.rst

Lines changed: 5 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -19,7 +19,8 @@ and open-source libraries
1919
`Goose3 <https://github.com/goose3/goose3>`_,
2020
`inscriptis <https://github.com/weblyzard/inscriptis>`_,
2121
`html2text <https://github.com/Alir3z4/html2text>`_,
22-
`jusText <https://github.com/miso-belica/jusText>`_.
22+
`jusText <https://github.com/miso-belica/jusText>`_,
23+
`BeautifulSoup <https://www.crummy.com/software/BeautifulSoup/bs4/doc/>`_.
2324
We release evaluation datasets and scripts,
2425
and provide more details in a whitepaper.
2526

@@ -56,6 +57,7 @@ Result of packages added after original evaluation::
5657
inscriptis precision=0.517 ± 0.017 recall=0.993 ± 0.001 F1=0.679 ± 0.015 accuracy=0.000 ± 0.000
5758
html2text precision=0.499 ± 0.017 recall=0.983 ± 0.002 F1=0.662 ± 0.015 accuracy=0.000 ± 0.000
5859
justext precision=0.858 ± 0.017 recall=0.754 ± 0.028 F1=0.802 ± 0.018 accuracy=0.088 ± 0.021
60+
beautifulsoup precision=0.499 ± 0.017 recall=0.994 ± 0.001 F1=0.665 ± 0.015 accuracy=0.000 ± 0.000
5961

6062
Below you can find more details about the packages and result reproduction.
6163

@@ -123,6 +125,8 @@ or external resources:
123125
converts HTML pages to Markup language
124126
- jusText: https://github.com/miso-belica/jusText -
125127
Heuristic based boilerplate removal tool
128+
- BeautifulSoup: https://www.crummy.com/software/BeautifulSoup/bs4/doc/ -
129+
Python library for pulling data out of HTML and XML files.
126130

127131
Output from these libraries is already present in the repo in ``output/*.json`` files.
128132
They were generated with ``extractors/run_*.py`` files.

extractors/run_beautifulsoup.py

Lines changed: 24 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,24 @@
1+
#!/usr/bin/env python3
2+
import gzip
3+
import json
4+
from pathlib import Path
5+
6+
from bs4 import BeautifulSoup
7+
8+
9+
def main():
10+
output = {}
11+
for path in Path('html').glob('*.html.gz'):
12+
with gzip.open(path, 'rt', encoding='utf8') as f:
13+
html = f.read()
14+
item_id = path.stem.split('.')[0]
15+
bs = BeautifulSoup(html, 'html.parser')
16+
article = bs.get_text(separator=' ', strip=True)
17+
output[item_id] = {'articleBody': article}
18+
(Path('output') / 'beautifulsoup.json').write_text(
19+
json.dumps(output, sort_keys=True, ensure_ascii=False, indent=4),
20+
encoding='utf8')
21+
22+
23+
if __name__ == '__main__':
24+
main()

0 commit comments

Comments
 (0)