Skip to content

Commit 8ce436d

Browse files
committed
Improve documentation about parsing URLs in lxml_html_clean.
1 parent 0d1a6e1 commit 8ce436d

File tree

3 files changed

+15
-0
lines changed

3 files changed

+15
-0
lines changed

README.md

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -4,6 +4,10 @@
44

55
This project was initially a part of [lxml](https://github.com/lxml/lxml). Because HTML cleaner is designed as blocklist-based, many reports about possible security vulnerabilities were filed for lxml and that make the project problematic for security-sensitive environments. Therefore we decided to extract the problematic part to a separate project.
66

7+
**Important**: the HTML Cleaner in ``lxml_html_clean`` is **not** considered appropriate **for security sensitive environments**. See e.g. [bleach](https://pypi.org/project/bleach/) for an alternative.
8+
9+
This project uses functions from Python's `urllib.parse` for URL parsing which **do not validate inputs**. For more information on potential security risks, refer to the [URL parsing security](https://docs.python.org/3/library/urllib.parse.html#url-parsing-security) documentation. A maliciously crafted URL could potentially bypass the allowed hosts check in `Cleaner`.
10+
711
## Installation
812

913
You can install this project directly via `pip install lxml_html_clean` or as an extra of lxml

docs/index.rst

Lines changed: 8 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -8,6 +8,14 @@ This project was initially a part of `lxml <https://github.com/lxml/lxml>`_. Bec
88
many reports about possible security vulnerabilities were filed for lxml and that make the project problematic for
99
security-sensitive environments. Therefore we decided to extract the problematic part to a separate project.
1010

11+
**Important**: the HTML Cleaner in ``lxml_html_clean`` is **not** considered appropriate **for security sensitive environments**.
12+
See e.g. `bleach <https://pypi.org/project/bleach/>`_ for an alternative.
13+
14+
This project uses functions from Python's ``urllib.parse`` for URL parsing which **do not validate inputs**.
15+
For more information on potential security risks, refer to the
16+
`URL parsing security <https://docs.python.org/3/library/urllib.parse.html#url-parsing-security>`_ documentation.
17+
A maliciously crafted URL could potentially bypass the allowed hosts check in ``Cleaner``.
18+
1119
Security
1220
--------
1321

lxml_html_clean/clean.py

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -185,6 +185,9 @@ class Cleaner:
185185
186186
Note that you may also need to set ``whitelist_tags``.
187187
188+
Note that URLs are parsed via functions from ``urllib.parse`` and
189+
no input validation is performed.
190+
188191
``whitelist_tags``:
189192
A set of tags that can be included with ``host_whitelist``.
190193
The default is ``iframe`` and ``embed``; you may wish to

0 commit comments

Comments
 (0)