Welcome | Get started | Dive | Contribute | Topics | Reference | Changes | More
Sanitizing HTML text¶
This document digs deeper into the sanitize_html() function.
This page contains code snippets (lines starting with >>>), which are
being tested during our development workflow. The following
snippet initializes the demo project used throughout this page.
>>> from lino_book.projects.min1.startup import *
Constants¶
The module defines two constants
ALLOWED_TAGS and ALLOWED_ATTRIBUTES.
- lino.utils.soup.ALLOWED_TAGS¶
A list of tag names that are to remain in sanitized HTML.
>>> from lino.utils.soup import ALLOWED_TAGS >>> from pprint import pprint >>> pprint(ALLOWED_TAGS) frozenset({'a', 'b', 'br', 'code', 'col', 'colgroup', 'def', 'div', 'em', 'h1', 'h2', 'h3', 'h4', 'h5', 'h6', 'h7', 'h8', 'h9', 'i', 'img', 'li', 'ol', 'p', 'pre', 'span', 'strong', 'table', 'tbody', 'td', 'tfoot', 'th', 'thead', 'tr', 'ul'})
- lino.utils.soup.ALLOWED_ATTRIBUTES¶
A dictionary mapping tagnames to a list of attribute names that are to remain in sanitized HTML.
>>> from lino.utils.soup import ALLOWED_ATTRIBUTES >>> pprint(ALLOWED_ATTRIBUTES, sort_dicts=True) ... {'a': {'href', 'title'}, 'abbr': {'title'}, 'acronym': {'title'}, 'p': {'align'}, 'span': {'class', 'contenteditable', 'data-denotation-char', 'data-index', 'data-link', 'data-title', 'data-value'}}
The above snippet is skipped because
pprint()displays the content of sets in arbitrary ordering even when sort_dicts is set to True.
Examples¶
Here are some tests to verify whether sanitize_html() does what we want.
>>> from lino.utils.soup import sanitize_html
Sanitizing “normalizes” the html content:
>>> print(sanitize_html("<p>One paragraph<p>Another paragraph"))
<p>One paragraph</p><p>Another paragraph</p>
>>> print(sanitize_html("<pre>"))
<pre></pre>
>>> print(sanitize_html("<pre>\n</pre>"))
<pre>
</pre>
When content is a single <p> tag, sanitizing NO LONGER unwraps it:
>>> print(sanitize_html("<p>One line<br>Another line"))
<p>One line<br>Another line</p>
>>> print(sanitize_html('<p align="center">One<br>two'))
<p align="center">One<br>two</p>
Plain text becomes a single paragraph and gets wrapped into a <p> tag:
>>> print(sanitize_html("Foo"))
<p>Foo</p>
Characters with special meaning get escaped:
>>> print(sanitize_html("Foo & Bar, Inc."))
<p>Foo & Bar, Inc.</p>
>>> print(sanitize_html("When a < b and b < c then a < c"))
<p>When a < b and b < c then a < c</p>
But valid formatting tags are recognized and preserved:
>>> print(sanitize_html("When <i>a</i> <b>and</b> <i>b</i> then <i>c</i>."))
<p>When <i>a</i> <b>and</b> <i>b</i> then <i>c</i>.</p>
Here is a surprising behaviour, which shows that you still should better escape yourself
>>> print(sanitize_html("About the <p> tag"))
<p>About the </p><p> tag</p>
The output is UTF-8 encoded, so we don’t need to escape umlauts and accents.
>>> print(sanitize_html("Ein süßes Kätzchen"))
<p>Ein süßes Kätzchen</p>
>>> print(sanitize_html("Monsieur l'Évêque loge à l'hôtel"))
<p>Monsieur l'Évêque loge à l'hôtel</p>
Even if you escape umlauts, sanitizing will render them as UTF-8. We are in the 21st century after all:
>>> print(sanitize_html("Ein süßes Kätzchen"))
<p>Ein süßes Kätzchen</p>
An empty string remains an empty string:
>>> sanitize_html("")
''
More examples¶
>>> print(sanitize_html("<pre></pre>"))
<pre></pre>
>>> print(sanitize_html("<p>Foo</p>"))
<p>Foo</p>
>>> print(sanitize_html("One<br>two"))
<p>One<br>two</p>
>>> print(sanitize_html("One<br>two</p>"))
<p>One<br>two</p>
>>> print(sanitize_html("<p></p>"))
<p></p>
>>> print(sanitize_html(""))
>>> content = """
... No tag at beginning of text.
... bla bLTaQSTyI80t2t8l
... foo bar.
... And here is some <b>bold</b> text.
...
... """
>>> print(sanitize_html(content))
<p>No tag at beginning of text.
bla bLTaQSTyI80t2t8l
foo bar.
And here is some <b>bold</b> text.</p>
>>> content = """
... <p align="right">First paragraph</p>
... <p onclick="kill()">Second paragraph</p>
... """
>>> print(sanitize_html(content))
<p align="right">First paragraph</p>
<p>Second paragraph</p>
>>> content = """
... <p>Here is a code example:</p>
... <p class="ql-code-block">int i = 2;</p>
... <p>More explanations</p>
... """
>>> print(sanitize_html(content))
<p>Here is a code example:</p>
<p class="ql-code-block">int i = 2;</p>
<p>More explanations</p>
>>> content = """
... <!DOCTYPE html>
... <html>
... <head>
... <meta http-equiv="content-type" content="text/html; charset=UTF-8">
... <title>Baby</title>
... </head>
... <body>
... This is a descriptive text with <b>some</b> formatting.<br>
... <br>
... Here is a second paragraph.<br>
... <br>
... </body>
... </html>
... """
>>> print(sanitize_html(content))
This is a descriptive text with <b>some</b> formatting.<br>
<br>
Here is a second paragraph.<br>
<br>
Automatically remove background color and other unwanted formatting¶
The lino.core.site.Site.unwrap_span_tags setting is a possible answer to
#6381 (Copying text from Quill causes the text to have grey background
color). It causes every <span> in HTML content to be removed. This setting
obviously breaks features like inserting images or using suggesters.
>>> content = """
... <p>
... <span style="background:grey">Here is some text with grey background</span>
... </p>
... """
>>> settings.SITE.unwrap_span_tags
False
>>> print(sanitize_html(content))
<p>
<span style="background:grey">Here is some text with grey background</span>
</p>
>>> settings.SITE.unwrap_span_tags = True
>>> print(sanitize_html(content))
<p>
Here is some text with grey background
</p>
Don’t forget to restore the default value:
>>> settings.SITE.unwrap_span_tags = False