Welcome | Get started | Dive | Contribute | Topics | Reference | Changes | More
lino.utils.soup¶
The lino.utils.soup module defines two functions sanitize() and
truncate_comment(), which are both done with the help of BeautifulSoup.
This page contains code snippets (lines starting with >>>), which are
being tested during our development workflow. The following
snippet initializes the demo project used throughout this page.
>>> from lino_book.projects.min1.startup import *
The sanitize_html() and truncate_comment() functions¶
The lino.utils.soup defines two functions sanitize_html() and
truncate(). Sanitizing is less than.
- lino.utils.soup.sanitize_html()¶
Parse the given HTML markup html and return a sanitized version of it.
- lino.utils.soup.truncate_comment(htmlstr, max_length=300)¶
Parse the given HTML markup, sanitize it and then return a single paragraph with a maximum number of visible characters.
Parsers and formats¶
Two known use cases of parsers are “html.parser” and “lxml”. Depending on the parser
and the formatter, used to decode the soup tree back to a string, the results of
sanitize_html() and truncate_comment() functions may vary.
The parser type is controlled by the USE_LXML
attribute of the lino.core.constants module.
Test cases below demonstrates the differences ovserved with different parsers and formatters:
>>> from bs4 import BeautifulSoup
>>> html = "<p>One paragraph<br><p>Another paragraph"
>>> soup = BeautifulSoup(html, features="html.parser")
>>> print(soup.decode())
<p>One paragraph<br/><p>Another paragraph</p></p>
>>> soup = BeautifulSoup(html, features="lxml")
>>> print(soup.decode())
<html><body><p>One paragraph<br/></p><p>Another paragraph</p></body></html>
>>> soup = BeautifulSoup(html, features="html.parser")
>>> print(soup.decode(formatter="html5"))
<p>One paragraph<br><p>Another paragraph</p></p>
>>> soup = BeautifulSoup(html, features="lxml")
>>> print(soup.decode(formatter="html5"))
<html><body><p>One paragraph<br></p><p>Another paragraph</p></body></html>
Recognize that in the last case above, the <br> tag is rendered without a
trailing slash.
In react quill editor, the self-closing tags are rendered without trailing slashes.
To keep the consistency, we use the “html5” formatter in our soup utils.
Further reading¶
See also Processing embedded images and Bleaching.