`lino.utils.soup`¶

The lino.utils.soup module defines two functions sanitize() and truncate_comment(), which are both done with the help of BeautifulSoup.

This page contains code snippets (lines starting with >>>), which are being tested during our development workflow. The following snippet initializes the demo project used throughout this page.

The `sanitize()` function ¶

lino.utils.soup.sanitize()¶: Parse the given HTML markup html and return a sanitized version of if.

lino.utils.soup.ALLOWED_TAGS¶

A list of tag names that are to remain in sanitized HTML.

>>> from lino.utils.soup import ALLOWED_TAGS
>>> from pprint import pprint
>>> pprint(ALLOWED_TAGS)
frozenset({'a', 'b', 'br', 'def', 'div', 'em', 'i', 'img', 'li', 'ol', 'p',
'pre', 'span', 'strong', 'table', 'tbody', 'td', 'tfoot', 'th', 'thead',
'tr', 'ul'})

lino.utils.soup.ALLOWED_ATTRIBUTES¶

A dictionary mapping tagnames to a list of attribute names that are to remain in sanitized HTML .

>>> from lino.utils.soup import ALLOWED_ATTRIBUTES
>>> pprint(ALLOWED_ATTRIBUTES, sort_dicts=True)
...
{'a': {'href', 'title'},
 'abbr': {'title'},
 'acronym': {'title'},
 'p': {'align'},
 'span': {'class',
          'contenteditable',
          'data-denotation-char',
          'data-index',
          'data-link',
          'data-title',
          'data-value'}}

The above snippet is skipped because pprint() displays the content of sets in arbitrary ordering even when sort_dicts is set to True.

Examples¶

Here are some tests to verify whether sanitize() does what we want.

>>> from lino.utils.soup import sanitize

Sanitizing “normalizes” the html content:

>>> print(sanitize("<p>One paragraph<p>Another paragraph"))
<p>One paragraph</p><p>Another paragraph</p>

>>> print(sanitize("<pre>"))
<pre></pre>

>>> print(sanitize("<pre>\n</pre>"))
<pre>
</pre>

When content is a single <p> tag, sanitizing NO LONGER unwraps it:

>>> print(sanitize("<p>One line<br>Another line"))
<p>One line<br/>Another line</p>

>>> print(sanitize('<p align="center">One<br>two'))
<p align="center">One<br/>two</p>

Plain text becomes a single paragraph and gets wrapped into a <p> tag:

>>> print(sanitize("Foo"))
<p>Foo</p>

Characters with special meaning get escaped:

>>> print(sanitize("Foo & Bar, Inc."))
<p>Foo &amp; Bar, Inc.</p>

>>> print(sanitize("When a < b and b < c then a < c"))
<p>When a &lt; b and b &lt; c then a &lt; c</p>

But valid formatting tags are recognized and preserved:

>>> print(sanitize("When <i>a</i> <b>and</b> <i>b</i> then <i>c</i>."))
<p>When <i>a</i> <b>and</b> <i>b</i> then <i>c</i>.</p>

Here is a surprising behaviour, which shows that you still should better escape yourself

>>> print(sanitize("About the <p> tag"))
<p>About the </p><p> tag</p>

The output is UTF-8 encoded, so we don’t need to escape umlauts and accents.

>>> print(sanitize("Ein süßes Kätzchen"))
<p>Ein süßes Kätzchen</p>

>>> print(sanitize("Monsieur l'Évêque loge à l'hôtel"))
<p>Monsieur l'Évêque loge à l'hôtel</p>

Even if you escape umlauts, sanitizing will render them as UTF-8. We are in the 21st century after all:

>>> print(sanitize("Ein s&uuml;&szlig;es K&auml;tzchen"))
<p>Ein süßes Kätzchen</p>

An empty string remains an empty string:

>>> sanitize("")
''

More examples¶

>>> print(sanitize("<pre></pre>"))
<pre></pre>

>>> print(sanitize("<p>Foo</p>"))
<p>Foo</p>

>>> print(sanitize("One<br>two"))
<p>One<br/>two</p>

>>> print(sanitize("One<br>two</p>"))
<p>One<br/>two</p>

>>> print(sanitize("<p></p>"))
<p></p>

>>> print(sanitize(""))

>>> content = """
... No tag at beginning of text.
... bla bLTaQSTyI80t2t8l
... foo bar.
... And here is some <b>bold</b> text.
...
... """
>>> print(sanitize(content))
<p>No tag at beginning of text.
bla bLTaQSTyI80t2t8l
foo bar.
And here is some <b>bold</b> text.</p>

>>> content = """
... <p align="right">First paragraph</p>
... <p onclick="kill()">Second paragraph</p>
... """
>>> print(sanitize(content))
<p align="right">First paragraph</p>
<p>Second paragraph</p>

>>> content = """
... <p>Here is a code example:</p>
... <p class="ql-code-block">int i = 2;</p>
... <p>More explanations</p>
... """
>>> print(sanitize(content))
<p>Here is a code example:</p>
<p class="ql-code-block">int i = 2;</p>
<p>More explanations</p>

>>> content = """
... <!DOCTYPE html>
... <html>
...   <head>
...     <meta http-equiv="content-type" content="text/html; charset=UTF-8">
...     <title>Baby</title>
...   </head>
...   <body>
...     This is a descriptive text with <b>some</b> formatting.<br>
...     <br>
...     Here is a second paragraph.<br>
...     <br>
...   </body>
... </html>
... """

>>> print(sanitize(content))
This is a descriptive text with <b>some</b> formatting.<br/>
<br/>
    Here is a second paragraph.<br/>
<br/>

The `truncate_comment()` function ¶

lino.utils.soup.truncate_comment(htmlstr, max_length=300)¶: Return a single paragraph with a maximum number of visible chars.

>>> from lino.utils.soup import truncate_comment as tc

Examples¶

>>> pasted = """<h1 style="color: #5e9ca0;">Styled comment
... <span style="color: #2b2301;">pasted from word!</span> </h1>"""

>>> print(tc(pasted))
...
<span>Styled comment
<span style="color: #2b2301;">pasted from word!</span> </span>

>>> print(tc(pasted, 17))
...
<span>Styled comment
<span style="color: #2b2301;">pa...</span> </span>

The first image remains but we enforce our style.

>>> from lino.utils.soup import SHORT_PREVIEW_IMAGE_HEIGHT
>>> SHORT_PREVIEW_IMAGE_HEIGHT
'8em'

>>> print(tc('<img src="foo" alt="bar"/></p>'))
<img alt="bar" src="foo" style="float:right;height:8em"/>

>>> print(tc('<IMG SRC="foo" ALT="bar"/>'))
<img alt="bar" src="foo" style="float:right;height:8em"/>

>>> two_images = """<p>First <img src="a.jpg"/> and <img src="b.jpg"/>.</p>"""
>>> tc(two_images)
'First <img src="a.jpg" style="float:right;height:8em"/> and ⌧.'

Paragraph tags are replaced by a whitespace while inline tags remain:

>>> print(tc("Try<pre>rm -r /</pre>and you might regret."))
Try rm -r / and you might regret.

>>> print(tc("Try <i>rm -r /</i>and you might regret."))
Try <i>rm -r /</i>and you might regret.

Unknown tags get sanitized into a <span>:

>>> print(tc("Try <bad>rm -r /</bad>and you might regret."))
Try <span>rm -r /</span>and you might regret.

>>> print(tc('<p>A short paragraph</p><p><ul><li>first</li><li>second</li></ul></p>'))
A short paragraph first second

>>> print(tc('Some plain text.'))
Some plain text.

>>> print(tc('Two paragraphs of plain text.\n\n\nHere is the second paragraph.'))
Two paragraphs of plain text.


Here is the second paragraph.

Truncation¶

>>> bold_and_italic = "<p>A <b>bold</b> and <i>italic</i> thing."
>>> lorem_ipsum = '<p>Lorem ipsum dolor sit amet, consectetur adipiscing elit.</p>'

>>> print(tc(bold_and_italic))
A <b>bold</b> and <i>italic</i> thing.

>>> print(tc(bold_and_italic, 5))
A <b>bol...</b>

>>> print(tc(bold_and_italic, 14))
A <b>bold</b> and <i>ita...</i>

The two following examples are cut at exactly the same place:

>>> print(tc(lorem_ipsum, 30))
Lorem ipsum dolor sit amet, co...

>>> print(tc('<p>Lorem <b>ipsum</b> dolor sit amet, consectetur adipiscing elit.</p>', 30))
Lorem <b>ipsum</b> dolor sit amet, co...

>>> print(tc('<p>Lorem <b>ipsum</b> dolor sit amet, consectetur adipiscing elit.</p>', 10))
Lorem <b>ipsu...</b>

>>> print(tc('<p>Lorem ipsum dolor sit amet</p><p>consectetur adipiscing elit.</p>', 30))
Lorem ipsum dolor sit amet cons...

Lorem ipsum dolor sit amet <BLANKLINE> cons…

>>> tc("<p>A plain paragraph with more than 20 characters.</p>", 20)
'A plain paragraph wi...'

Multiple paragraphs are summarized:

>>> tc("<p>aaaa.</p><p>bbbb.</p><p>cccc.</p><p>dddd.</p><p>eeee.</p>", 20)
'aaaa. bbbb. cccc. ddd...'

TODO: In above result there is one “d” too much at the end. Why?

>>> tc("<div>{}</div>".format(lorem_ipsum), 20)
'Lorem ipsum dolor si...'

Exploring BeautifulSoup¶

>>> from bs4 import BeautifulSoup
>>> def walk(ch, indent=0):
...    prefix = " " * indent
...    if hasattr(ch, 'tag'):
...      print(prefix + str(type(ch)) + " " + ch.name + ":")
...      for c in ch.children:
...        walk(c, indent+2)
...    else:
...      print(prefix + str(type(ch)) + " " + repr(ch.string))
...      # print(prefix+repr(ch.string))
>>> soup = BeautifulSoup(bold_and_italic, "html.parser")
>>> walk(soup)
...
<class 'bs4.BeautifulSoup'> [document]:
  <class 'bs4.element.Tag'> p:
    <class 'bs4.element.NavigableString'> 'A '
    <class 'bs4.element.Tag'> b:
      <class 'bs4.element.NavigableString'> 'bold'
    <class 'bs4.element.NavigableString'> ' and '
    <class 'bs4.element.Tag'> i:
      <class 'bs4.element.NavigableString'> 'italic'
    <class 'bs4.element.NavigableString'> ' thing.'

>>> soup = BeautifulSoup(lorem_ipsum, "html.parser")
>>> walk(soup)
...
<class 'bs4.BeautifulSoup'> [document]:
  <class 'bs4.element.Tag'> p:
    <class 'bs4.element.NavigableString'> 'Lorem ipsum dolor sit amet, consectetur adipiscing elit.'

lino.utils.soup¶

The sanitize() function¶

Examples¶

More examples¶

The truncate_comment() function¶

Examples¶

Truncation¶

Exploring BeautifulSoup¶

Further reading¶

`lino.utils.soup`¶

The `sanitize()` function ¶

The `truncate_comment()` function ¶

Further reading ¶