|
| 1 | +[](https://pypi.python.org/pypi/readability-lxml) |
| 2 | + |
| 3 | +# python-readability |
| 4 | + |
| 5 | +Given an HTML document, extract and clean up the main body text and title. |
| 6 | + |
| 7 | +This is a Python port of a Ruby port of [arc90's Readability project](https://web.archive.org/web/20130519040221/http://www.readability.com/). |
| 8 | + |
| 9 | +## Installation |
| 10 | + |
| 11 | +It's easy using `pip`, just run: |
| 12 | + |
| 13 | +```bash |
| 14 | +$ pip install readability-lxml |
| 15 | +``` |
| 16 | + |
| 17 | +As an alternative, you may also use conda to install, just run: |
| 18 | + |
| 19 | +```bash |
| 20 | +$ conda install -c conda-forge readability-lxml |
| 21 | +``` |
| 22 | + |
| 23 | +## Usage |
| 24 | + |
| 25 | +```python |
| 26 | +>>> import requests |
| 27 | +>>> from readability import Document |
| 28 | + |
| 29 | +>>> response = requests.get('http://example.com') |
| 30 | +>>> doc = Document(response.content) |
| 31 | +>>> doc.title() |
| 32 | +'Example Domain' |
| 33 | + |
| 34 | +>>> doc.summary() |
| 35 | +"""<html><body><div><body id="readabilityBody">\n<div>\n <h1>Example Domain</h1>\n |
| 36 | +<p>This domain is established to be used for illustrative examples in documents. You may |
| 37 | +use this\n domain in examples without prior coordination or asking for permission.</p> |
| 38 | +\n <p><a href="http://www.iana.org/domains/example">More information...</a></p>\n</div> |
| 39 | +\n</body>\n</div></body></html>""" |
| 40 | +``` |
| 41 | + |
| 42 | +## Change Log |
| 43 | +- 0.8.4 Better CJK support, thanks @cdhigh |
| 44 | +- 0.8.3.1 Support for python 3.8 - 3.13 |
| 45 | +- 0.8.3 We can now save all images via keep_all_images=True (default is to save 1 main image), thanks @botlabsDev |
| 46 | +- 0.8.2 Added article author(s) (thanks @mattblaha) |
| 47 | +- 0.8.1 Fixed processing of non-ascii HTMLs via regexps. |
| 48 | +- 0.8 Replaced XHTML output with HTML5 output in summary() call. |
| 49 | +- 0.7.1 Support for Python 3.7 . Fixed a slowdown when processing documents with lots of spaces. |
| 50 | +- 0.7 Improved HTML5 tags handling. Fixed stripping unwanted HTML nodes (only first matching node was removed before). |
| 51 | +- 0.6 Finally a release which supports Python versions 2.6, 2.7, 3.3 - 3.6 |
| 52 | +- 0.5 Preparing a release to support Python versions 2.6, 2.7, 3.3 and 3.4 |
| 53 | +- 0.4 Added Videos loading and allowed more images per paragraph |
| 54 | +- 0.3 Added Document.encoding, positive\_keywords and negative\_keywords |
| 55 | + |
| 56 | +## Licensing |
| 57 | + |
| 58 | +This code is under [the Apache License 2.0](http://www.apache.org/licenses/LICENSE-2.0) license. |
| 59 | + |
| 60 | +## Thanks to |
| 61 | + |
| 62 | +- Latest [readability.js](https://github.com/MHordecki/readability-redux/blob/master/readability/readability.js) |
| 63 | +- Ruby port by starrhorne and iterationlabs |
| 64 | +- [Python port](https://github.com/gfxmonk/python-readability) by gfxmonk |
| 65 | +- [Decruft effort](https://web.archive.org/web/20110214150709/https://www.minvolai.com/blog/decruft-arc90s-readability-in-python/) to move to lxml |
| 66 | +- "BR to P" fix from readability.js which improves quality for smaller texts |
| 67 | +- Github users contributions. |
0 commit comments