ForkLab
diff --git a/‎README.rst
Lines changed: 5 additions & 1 deletion b/‎README.rst
Lines changed: 5 additions & 1 deletion
diff --git a/‎extractors/run_beautifulsoup.py
Lines changed: 24 additions & 0 deletions b/‎extractors/run_beautifulsoup.py
Lines changed: 24 additions & 0 deletions
@@ -19,7 +19,8 @@ and open-source libraries
 `Goose3 <https://github.com/goose3/goose3>`_,
 `inscriptis <https://github.com/weblyzard/inscriptis>`_,
 `html2text <https://github.com/Alir3z4/html2text>`_,
-`jusText <https://github.com/miso-belica/jusText>`_.
+`jusText <https://github.com/miso-belica/jusText>`_,
+`BeautifulSoup <https://www.crummy.com/software/BeautifulSoup/bs4/doc/>`_.
 We release evaluation datasets and scripts,
 and provide more details in a whitepaper.
 
@@ -56,6 +57,7 @@ Result of packages added after original evaluation::
     inscriptis           precision=0.517 ± 0.017  recall=0.993 ± 0.001  F1=0.679 ± 0.015 accuracy=0.000 ± 0.000
     html2text            precision=0.499 ± 0.017  recall=0.983 ± 0.002  F1=0.662 ± 0.015 accuracy=0.000 ± 0.000
     justext              precision=0.858 ± 0.017  recall=0.754 ± 0.028  F1=0.802 ± 0.018 accuracy=0.088 ± 0.021
+    beautifulsoup        precision=0.499 ± 0.017  recall=0.994 ± 0.001  F1=0.665 ± 0.015 accuracy=0.000 ± 0.000
 
 Below you can find more details about the packages and result reproduction.
 
@@ -123,6 +125,8 @@ or external resources:
   converts HTML pages to Markup language
 - jusText: https://github.com/miso-belica/jusText -
   Heuristic based boilerplate removal tool
+- BeautifulSoup: https://www.crummy.com/software/BeautifulSoup/bs4/doc/ -
+  Python library for pulling data out of HTML and XML files.
 
 Output from these libraries is already present in the repo in ``output/*.json`` files.
 They were generated with ``extractors/run_*.py`` files.
 
@@ -0,0 +1,24 @@
+#!/usr/bin/env python3
+import gzip
+import json
+from pathlib import Path
+
+from bs4 import BeautifulSoup
+
+
+def main():
+    output = {}
+    for path in Path('html').glob('*.html.gz'):
+        with gzip.open(path, 'rt', encoding='utf8') as f:
+            html = f.read()
+        item_id = path.stem.split('.')[0]
+        bs = BeautifulSoup(html, 'html.parser')
+        article = bs.get_text(separator=' ', strip=True)
+        output[item_id] = {'articleBody': article}
+    (Path('output') / 'beautifulsoup.json').write_text(
+        json.dumps(output, sort_keys=True, ensure_ascii=False, indent=4),
+        encoding='utf8')
+
+
+if __name__ == '__main__':
+    main()