Welcome | Get started | Dive | Contribute | Topics | Reference | Changes | More

lino.utils.soup

The lino.utils.soup module defines two functions sanitize() and truncate_comment(), which are both done with the help of BeautifulSoup.

This page contains code snippets (lines starting with >>>), which are being tested during our development workflow. The following snippet initializes the demo project used throughout this page.

>>> from lino_book.projects.min1.startup import *

The sanitize_html() and truncate_comment() functions

The lino.utils.soup defines two functions sanitize_html() and truncate(). Sanitizing is less than.

lino.utils.soup.sanitize_html()

Parse the given HTML markup html and return a sanitized version of it.

See Sanitizing HTML text

lino.utils.soup.truncate_comment(htmlstr, max_length=300)

Parse the given HTML markup, sanitize it and then return a single paragraph with a maximum number of visible characters.

See Truncating HTML texts

Parsers and formats

Two known use cases of parsers are “html.parser” and “lxml”. Depending on the parser and the formatter, used to decode the soup tree back to a string, the results of sanitize_html() and truncate_comment() functions may vary.

The parser type is controlled by the USE_LXML attribute of the lino.core.constants module.

Test cases below demonstrates the differences ovserved with different parsers and formatters:

>>> from bs4 import BeautifulSoup
>>> html = "<p>One paragraph<br><p>Another paragraph"
>>> soup = BeautifulSoup(html, features="html.parser")
>>> print(soup.decode())
<p>One paragraph<br/><p>Another paragraph</p></p>
>>> soup = BeautifulSoup(html, features="lxml")
>>> print(soup.decode())
<html><body><p>One paragraph<br/></p><p>Another paragraph</p></body></html>
>>> soup = BeautifulSoup(html, features="html.parser")
>>> print(soup.decode(formatter="html5"))
<p>One paragraph<br><p>Another paragraph</p></p>
>>> soup = BeautifulSoup(html, features="lxml")
>>> print(soup.decode(formatter="html5"))
<html><body><p>One paragraph<br></p><p>Another paragraph</p></body></html>

Recognize that in the last case above, the <br> tag is rendered without a trailing slash.

In react quill editor, the self-closing tags are rendered without trailing slashes. To keep the consistency, we use the “html5” formatter in our soup utils.

Further reading

See also Processing embedded images and Bleaching.