Welcome | Get started | Dive | Contribute | Topics | Reference | Changes | More

Sanitizing HTML text

This document digs deeper into the sanitize_html() function.

This page contains code snippets (lines starting with >>>), which are being tested during our development workflow. The following snippet initializes the demo project used throughout this page.

>>> from lino_book.projects.min1.startup import *

Constants

The module defines two constants ALLOWED_TAGS and ALLOWED_ATTRIBUTES.

lino.utils.soup.ALLOWED_TAGS

A list of tag names that are to remain in sanitized HTML.

>>> from lino.utils.soup import ALLOWED_TAGS
>>> from pprint import pprint
>>> pprint(ALLOWED_TAGS)
frozenset({'a', 'b', 'br', 'code', 'col', 'colgroup', 'def', 'div', 'em', 'h1', 'h2',
'h3', 'h4', 'h5', 'h6', 'h7', 'h8', 'h9', 'i', 'img', 'li', 'ol', 'p', 'pre',
'span', 'strong', 'table', 'tbody', 'td', 'tfoot', 'th', 'thead', 'tr', 'ul'})
lino.utils.soup.ALLOWED_ATTRIBUTES

A dictionary mapping tagnames to a list of attribute names that are to remain in sanitized HTML.

>>> from lino.utils.soup import ALLOWED_ATTRIBUTES
>>> pprint(ALLOWED_ATTRIBUTES, sort_dicts=True)
...
{'a': {'href', 'title'},
 'abbr': {'title'},
 'acronym': {'title'},
 'p': {'align'},
 'span': {'class',
          'contenteditable',
          'data-denotation-char',
          'data-index',
          'data-link',
          'data-title',
          'data-value'}}

The above snippet is skipped because pprint() displays the content of sets in arbitrary ordering even when sort_dicts is set to True.

Examples

Here are some tests to verify whether sanitize_html() does what we want.

>>> from lino.utils.soup import sanitize_html

Sanitizing “normalizes” the html content:

>>> print(sanitize_html("<p>One paragraph<p>Another paragraph"))
<p>One paragraph</p><p>Another paragraph</p>
>>> print(sanitize_html("<pre>"))
<pre></pre>
>>> print(sanitize_html("<pre>\n</pre>"))
<pre>
</pre>

When content is a single <p> tag, sanitizing NO LONGER unwraps it:

>>> print(sanitize_html("<p>One line<br>Another line"))
<p>One line<br>Another line</p>
>>> print(sanitize_html('<p align="center">One<br>two'))
<p align="center">One<br>two</p>

Plain text becomes a single paragraph and gets wrapped into a <p> tag:

>>> print(sanitize_html("Foo"))
<p>Foo</p>

Characters with special meaning get escaped:

>>> print(sanitize_html("Foo & Bar, Inc."))
<p>Foo & Bar, Inc.</p>
>>> print(sanitize_html("When a < b and b < c then a < c"))
<p>When a &lt; b and b &lt; c then a &lt; c</p>

But valid formatting tags are recognized and preserved:

>>> print(sanitize_html("When <i>a</i> <b>and</b> <i>b</i> then <i>c</i>."))
<p>When <i>a</i> <b>and</b> <i>b</i> then <i>c</i>.</p>

Here is a surprising behaviour, which shows that you still should better escape yourself

>>> print(sanitize_html("About the <p> tag"))
<p>About the </p><p> tag</p>

The output is UTF-8 encoded, so we don’t need to escape umlauts and accents.

>>> print(sanitize_html("Ein süßes Kätzchen"))
<p>Ein s&uuml;&szlig;es K&auml;tzchen</p>
>>> print(sanitize_html("Monsieur l'Évêque loge à l'hôtel"))
<p>Monsieur l'&Eacute;v&ecirc;que loge &agrave; l'h&ocirc;tel</p>

Even if you escape umlauts, sanitizing will render them as UTF-8. We are in the 21st century after all:

>>> print(sanitize_html("Ein s&uuml;&szlig;es K&auml;tzchen"))
<p>Ein s&uuml;&szlig;es K&auml;tzchen</p>

An empty string remains an empty string:

>>> sanitize_html("")
''

More examples

>>> print(sanitize_html("<pre></pre>"))
<pre></pre>
>>> print(sanitize_html("<p>Foo</p>"))
<p>Foo</p>
>>> print(sanitize_html("One<br>two"))
<p>One<br>two</p>
>>> print(sanitize_html("One<br>two</p>"))
<p>One<br>two</p>
>>> print(sanitize_html("<p></p>"))
<p></p>
>>> print(sanitize_html(""))
>>> content = """
... No tag at beginning of text.
... bla bLTaQSTyI80t2t8l
... foo bar.
... And here is some <b>bold</b> text.
...
... """
>>> print(sanitize_html(content))
<p>No tag at beginning of text.
bla bLTaQSTyI80t2t8l
foo bar.
And here is some <b>bold</b> text.</p>
>>> content = """
... <p align="right">First paragraph</p>
... <p onclick="kill()">Second paragraph</p>
... """
>>> print(sanitize_html(content))
<p align="right">First paragraph</p>
<p>Second paragraph</p>
>>> content = """
... <p>Here is a code example:</p>
... <p class="ql-code-block">int i = 2;</p>
... <p>More explanations</p>
... """
>>> print(sanitize_html(content))
<p>Here is a code example:</p>
<p class="ql-code-block">int i = 2;</p>
<p>More explanations</p>
>>> content = """
... <!DOCTYPE html>
... <html>
...   <head>
...     <meta http-equiv="content-type" content="text/html; charset=UTF-8">
...     <title>Baby</title>
...   </head>
...   <body>
...     This is a descriptive text with <b>some</b> formatting.<br>
...     <br>
...     Here is a second paragraph.<br>
...     <br>
...   </body>
... </html>
... """
>>> print(sanitize_html(content))
This is a descriptive text with <b>some</b> formatting.<br>
<br>
    Here is a second paragraph.<br>
<br>

Automatically remove background color and other unwanted formatting

The lino.core.site.Site.unwrap_span_tags setting is a possible answer to #6381 (Copying text from Quill causes the text to have grey background color). It causes every <span> in HTML content to be removed. This setting obviously breaks features like inserting images or using suggesters.

>>> content = """
... <p>
... <span style="background:grey">Here is some text with grey background</span>
... </p>
... """
>>> settings.SITE.unwrap_span_tags
False
>>> print(sanitize_html(content))
<p>
<span style="background:grey">Here is some text with grey background</span>
</p>
>>> settings.SITE.unwrap_span_tags = True
>>> print(sanitize_html(content))
<p>
Here is some text with grey background
</p>

Don’t forget to restore the default value:

>>> settings.SITE.unwrap_span_tags = False