Welcome | Get started | Dive | Contribute | Topics | Reference | Changes | More
Bleaching¶
When an end user copies rich text from other applications into Lino, the text can contain styles and other things that cause side effects when displaying or printing them. Or a malicious user might deliberately insert HTML with scripts or other diabolic things in order to harm your server. In order to avoid such problems, we remove any dangerous parts from content that gets entered into a rich text field using the web interface. This process is called to “sanitize” or to “bleach”.
Until November 2024, we used the bleach Python package for sanitizing HTML
input. But this package had been deprecated in January 2023. Now we
use our own function lino.utils.soup.sanitize()
, which relies on
BeautifulSoup and is inspired by a blog post by Chase Seibert.
Side note: Code snippets (lines starting with >>>
) in this document get
tested as part of our development workflow. The following
initialization snippet tells you which demo project is being used in
this document.
>>> import os
>>> from lino import startup
>>> startup('lino_book.projects.min2.settings.doctests')
>>> from lino.api.doctest import *
Usage¶
All rich text fields (RichHtmlField To activate bleaching of all rich
text fields (:class:`RichHtmlField
), get
bleached by default. To deactivate this feature, set textfield_bleached
to False in your
settings.py
file:
textfield_bleached = False
You might also set textfield_bleached
to False, but keep in mind that
this is only the default value.
The application developer can force bleaching to be activated or not for a
specific field by explicitly saying a bleached
argument when declaring the field.
How to bleach existing unbleached data¶
The lino.modlib.system.BleachChecker
data checker reports fields
whose content would change by bleach. This is useful when you activate
Bleaching on a site with existing data. After activating bleach, you
can check for unbleached content by saying:
$ django-admin checkdata system.BleachChecker
After this you can use the web interface to inspect the data problems. To
manually bleach a single database object, simply save it using the web
interface. You should make sure that bleach does not remove any content which
is actually needed. If this happens, you must manually restore the content of
the tested database objects, or restore a full backup and then set your
bleach_allowed_tags
setting.
To bleach all existing data, you can say:
$ django-admin checkdata system.BleachChecker --fix
The sanitize()
function¶
- lino.utils.soup.sanitize(html)¶
Parse the given HTML chunk html and return a sanitized version of if.
- lino.utils.soup.ALLOWED_TAGS¶
A list of tag names that are to remain in HTML comments if bleaching is active.
>>> from lino.utils.soup import ALLOWED_TAGS >>> pprint(ALLOWED_TAGS) frozenset({'a', 'b', 'br', 'def', 'div', 'em', 'i', 'img', 'li', 'ol', 'p', 'pre', 'span', 'strong', 'table', 'tbody', 'td', 'tfoot', 'th', 'thead', 'tr', 'ul'})
- lino.utils.soup.ALLOWED_ATTRIBUTES¶
A dictionary of key-values for tagname-attributes_list which are to remain in HTML comments if bleaching is active.
>>> from lino.utils.soup import ALLOWED_ATTRIBUTES >>> pprint(ALLOWED_ATTRIBUTES, sort_dicts=True) ... {'a': {'href', 'title'}, 'abbr': {'title'}, 'acronym': {'title'}, 'p': {'align'}, 'span': {'class', 'contenteditable', 'data-denotation-char', 'data-index', 'data-link', 'data-title', 'data-value'}}
The above snippet is skipped because
pprint()
displays the content of sets in arbitrary ordering even when sort_dicts is set to True.
Examples¶
Here are some tests to verify whether bleaching does what we want.
Which models have bleachable fields?
>>> checker = checkdata.Checkers.get_by_value('system.BleachChecker')
>>> lst = [str(m) for m in checker.get_checkable_models()]
>>> print('\n'.join(sorted(lst)))
<class 'lino_xl.lib.cal.models.Calendar'>
<class 'lino_xl.lib.cal.models.Event'>
<class 'lino_xl.lib.cal.models.EventType'>
<class 'lino_xl.lib.cal.models.RecurrentEvent'>
<class 'lino_xl.lib.cal.models.Room'>
<class 'lino_xl.lib.cal.models.Task'>
>>> from lino.utils.soup import sanitize
>>> print(sanitize(""))
Lino bleaches only content that starts with a “<”, not e.g. reSTructuredText:
>>> print(sanitize("A *greatly* **formatted** text: \n\n- one \n\n -two"))
...
A *greatly* **formatted** text:
- one
-two
Bleaching “normalizes” the html content:
>>> print(sanitize("<p>One paragraph<p>Another paragraph"))
<p>One paragraph</p><p>Another paragraph</p>
>>> print(sanitize("<pre>"))
<pre></pre>
>>> print(sanitize("<pre>\n</pre>"))
<pre>
</pre>
When content is wrapped into a single <p>
tag, bleaching unwraps it:
>>> print(sanitize("<p>One line<br>Another line"))
One line<br/>Another line
>>> print(sanitize('<p align="center">One<br>two'))
<p align="center">One<br/>two</p>
More examples:
>>> print(sanitize("<pre></pre>"))
<pre></pre>
>>> print(sanitize("<p>Foo</p>"))
Foo
>>> print(sanitize("One<br>two"))
One<br/>two
>>> print(sanitize("One<br>two</p>"))
One<br/>two
>>> content = """
... No tag at beginning of text.
... bla bLTaQSTyI80t2t8l
... foo bar.
... And here is some <b>bold</b> text.
...
... """
>>> print(sanitize(content))
No tag at beginning of text.
bla bLTaQSTyI80t2t8l
foo bar.
And here is some <b>bold</b> text.
>>> content = """
... <p align="right">First paragraph</p>
... <p onclick="kill()">Second paragraph</p>
... """
>>> print(sanitize(content))
<p align="right">First paragraph</p>
<p>Second paragraph</p>
>>> content = """
... <!DOCTYPE html>
... <html>
... <head>
... <meta http-equiv="content-type" content="text/html; charset=UTF-8">
... <title>Baby</title>
... </head>
... <body>
... This is a descriptive text with <b>some</b> formatting.<br>
... <br>
... Here is a second paragraph.<br>
... <br>
... </body>
... </html>
... """
>>> print(sanitize(content))
This is a descriptive text with <b>some</b> formatting.<br/>
<br/>
Here is a second paragraph.<br/>
<br/>
Historical notes¶
bleach until 20170225 required html5lib` version 0.9999999 (7*”9”) while the current version is 0.999999999 (9*”9”). Which means that you might inadvertently break bleach when you ask to update html5lib:
$ pip install -U html5lib
...
Successfully installed html5lib-0.999999999
$ python -m bleach
Traceback (most recent call last):
File "/usr/lib/python2.7/runpy.py", line 163, in _run_module_as_main
mod_name, _Error)
File "/usr/lib/python2.7/runpy.py", line 111, in _get_module_details
__import__(mod_name) # Do not catch exceptions initializing package
File "/site-packages/bleach/__init__.py", line 14, in <module>
from html5lib.sanitizer import HTMLSanitizer
ImportError: No module named sanitizer