Welcome | Get started | Dive | Contribute | Topics | Reference | Changes | More
lino.utils.soup
¶
The lino.utils.soup
module defines two functions sanitize()
and
truncate_comments()
, which are both done with the help of BeautifulSoup.
This page contains code snippets (lines starting with >>>
), which are
being tested during our development workflow. The following
snippet initializes the demo project used throughout this page.
The sanitize()
function¶
- lino.utils.soup.sanitize()¶
Parse the given HTML markup html and return a sanitized version of if.
- lino.utils.soup.ALLOWED_TAGS¶
A list of tag names that are to remain in sanitized HTML.
>>> from lino.utils.soup import ALLOWED_TAGS >>> from pprint import pprint >>> pprint(ALLOWED_TAGS) frozenset({'a', 'b', 'br', 'def', 'div', 'em', 'i', 'img', 'li', 'ol', 'p', 'pre', 'span', 'strong', 'table', 'tbody', 'td', 'tfoot', 'th', 'thead', 'tr', 'ul'})
- lino.utils.soup.ALLOWED_ATTRIBUTES¶
A dictionary mapping tagnames to a list of attribute names that are to remain in sanitized HTML .
>>> from lino.utils.soup import ALLOWED_ATTRIBUTES >>> pprint(ALLOWED_ATTRIBUTES, sort_dicts=True) ... {'a': {'href', 'title'}, 'abbr': {'title'}, 'acronym': {'title'}, 'p': {'align'}, 'span': {'class', 'contenteditable', 'data-denotation-char', 'data-index', 'data-link', 'data-title', 'data-value'}}
The above snippet is skipped because
pprint()
displays the content of sets in arbitrary ordering even when sort_dicts is set to True.
Examples¶
Here are some tests to verify whether sanitize()
does what we want.
>>> from lino.utils.soup import sanitize
Sanitizing “normalizes” the html content:
>>> print(sanitize("<p>One paragraph<p>Another paragraph"))
<p>One paragraph</p><p>Another paragraph</p>
>>> print(sanitize("<pre>"))
<pre></pre>
>>> print(sanitize("<pre>\n</pre>"))
<pre>
</pre>
When content is a single <p>
tag, sanitizing NO LONGER unwraps it:
>>> print(sanitize("<p>One line<br>Another line"))
<p>One line<br/>Another line</p>
>>> print(sanitize('<p align="center">One<br>two'))
<p align="center">One<br/>two</p>
Plain text becomes a single paragraph and gets wrapped into a <p>
tag:
>>> print(sanitize("Foo"))
<p>Foo</p>
Characters with special meaning get escaped:
>>> print(sanitize("Foo & Bar, Inc."))
<p>Foo & Bar, Inc.</p>
>>> print(sanitize("When a < b and b < c then a < c"))
<p>When a < b and b < c then a < c</p>
But valid formatting tags are recognized and preserved:
>>> print(sanitize("When <i>a</i> <b>and</b> <i>b</i> then <i>c</i>."))
<p>When <i>a</i> <b>and</b> <i>b</i> then <i>c</i>.</p>
Here is a surprising behaviour, which shows that you still should better escape yourself
>>> print(sanitize("About the <p> tag"))
<p>About the </p><p> tag</p>
The output is UTF-8 encoded, so we don’t need to escape umlauts and accents.
>>> print(sanitize("Ein süßes Kätzchen"))
<p>Ein süßes Kätzchen</p>
>>> print(sanitize("Monsieur l'Évêque loge à l'hôtel"))
<p>Monsieur l'Évêque loge à l'hôtel</p>
Even if you escape umlauts, sanitizing will render them as UTF-8. We are in the 21st century after all:
>>> print(sanitize("Ein süßes Kätzchen"))
<p>Ein süßes Kätzchen</p>
An empty string remains an empty string:
>>> sanitize("")
''
More examples¶
>>> print(sanitize("<pre></pre>"))
<pre></pre>
>>> print(sanitize("<p>Foo</p>"))
<p>Foo</p>
>>> print(sanitize("One<br>two"))
<p>One<br/>two</p>
>>> print(sanitize("One<br>two</p>"))
<p>One<br/>two</p>
>>> print(sanitize("<p></p>"))
<p></p>
>>> print(sanitize(""))
>>> content = """
... No tag at beginning of text.
... bla bLTaQSTyI80t2t8l
... foo bar.
... And here is some <b>bold</b> text.
...
... """
>>> print(sanitize(content))
<p>No tag at beginning of text.
bla bLTaQSTyI80t2t8l
foo bar.
And here is some <b>bold</b> text.</p>
>>> content = """
... <p align="right">First paragraph</p>
... <p onclick="kill()">Second paragraph</p>
... """
>>> print(sanitize(content))
<p align="right">First paragraph</p>
<p>Second paragraph</p>
>>> content = """
... <p>Here is a code example:</p>
... <p class="ql-code-block">int i = 2;</p>
... <p>More explanations</p>
... """
>>> print(sanitize(content))
<p>Here is a code example:</p>
<p class="ql-code-block">int i = 2;</p>
<p>More explanations</p>
>>> content = """
... <!DOCTYPE html>
... <html>
... <head>
... <meta http-equiv="content-type" content="text/html; charset=UTF-8">
... <title>Baby</title>
... </head>
... <body>
... This is a descriptive text with <b>some</b> formatting.<br>
... <br>
... Here is a second paragraph.<br>
... <br>
... </body>
... </html>
... """
>>> print(sanitize(content))
This is a descriptive text with <b>some</b> formatting.<br/>
<br/>
Here is a second paragraph.<br/>
<br/>
The truncate_comment()
function¶
- lino.utils.soup.truncate_comment(htmlstr, max_length=300)¶
Return a single paragraph with a maximum number of visible chars.
>>> from lino.utils.soup import truncate_comment as tc
Examples¶
>>> pasted = """<h1 style="color: #5e9ca0;">Styled comment
... <span style="color: #2b2301;">pasted from word!</span> </h1>"""
>>> print(tc(pasted))
...
<span>Styled comment
<span style="color: #2b2301;">pasted from word!</span> </span>
>>> print(tc(pasted, 17))
...
<span>Styled comment
<span style="color: #2b2301;">pa...</span> </span>
The first image remains but we enforce our style.
>>> from lino.utils.soup import SHORT_PREVIEW_IMAGE_HEIGHT
>>> SHORT_PREVIEW_IMAGE_HEIGHT
'8em'
>>> print(tc('<img src="foo" alt="bar"/></p>'))
<img alt="bar" src="foo" style="float:right;height:8em"/>
>>> print(tc('<IMG SRC="foo" ALT="bar"/>'))
<img alt="bar" src="foo" style="float:right;height:8em"/>
>>> two_images = """<p>First <img src="a.jpg"/> and <img src="b.jpg"/>.</p>"""
>>> tc(two_images)
'First <img src="a.jpg" style="float:right;height:8em"/> and ⌧.'
Paragraph tags are replaced by a whitespace while inline tags remain:
>>> print(tc("Try<pre>rm -r /</pre>and you might regret."))
Try rm -r / and you might regret.
>>> print(tc("Try <i>rm -r /</i>and you might regret."))
Try <i>rm -r /</i>and you might regret.
Unknown tags get sanitized into a <span>:
>>> print(tc("Try <bad>rm -r /</bad>and you might regret."))
Try <span>rm -r /</span>and you might regret.
>>> print(tc('<p>A short paragraph</p><p><ul><li>first</li><li>second</li></ul></p>'))
A short paragraph first second
>>> print(tc('Some plain text.'))
Some plain text.
>>> print(tc('Two paragraphs of plain text.\n\n\nHere is the second paragraph.'))
Two paragraphs of plain text.
Here is the second paragraph.
Truncation¶
>>> bold_and_italic = "<p>A <b>bold</b> and <i>italic</i> thing."
>>> lorem_ipsum = '<p>Lorem ipsum dolor sit amet, consectetur adipiscing elit.</p>'
>>> print(tc(bold_and_italic))
A <b>bold</b> and <i>italic</i> thing.
>>> print(tc(bold_and_italic, 5))
A <b>bol...</b>
>>> print(tc(bold_and_italic, 14))
A <b>bold</b> and <i>ita...</i>
The two following examples are cut at exactly the same place:
>>> print(tc(lorem_ipsum, 30))
Lorem ipsum dolor sit amet, co...
>>> print(tc('<p>Lorem <b>ipsum</b> dolor sit amet, consectetur adipiscing elit.</p>', 30))
Lorem <b>ipsum</b> dolor sit amet, co...
>>> print(tc('<p>Lorem <b>ipsum</b> dolor sit amet, consectetur adipiscing elit.</p>', 10))
Lorem <b>ipsu...</b>
>>> print(tc('<p>Lorem ipsum dolor sit amet</p><p>consectetur adipiscing elit.</p>', 30))
Lorem ipsum dolor sit amet cons...
Lorem ipsum dolor sit amet <BLANKLINE> cons…
>>> tc("<p>A plain paragraph with more than 20 characters.</p>", 20)
'A plain paragraph wi...'
Multiple paragraphs are summarized:
>>> tc("<p>aaaa.</p><p>bbbb.</p><p>cccc.</p><p>dddd.</p><p>eeee.</p>", 20)
'aaaa. bbbb. cccc. ddd...'
TODO: In above result there is one “d” too much at the end. Why?
>>> tc("<div>{}</div>".format(lorem_ipsum), 20)
'Lorem ipsum dolor si...'
Exploring BeautifulSoup¶
>>> from bs4 import BeautifulSoup
>>> def walk(ch, indent=0):
... prefix = " " * indent
... if hasattr(ch, 'tag'):
... print(prefix + str(type(ch)) + " " + ch.name + ":")
... for c in ch.children:
... walk(c, indent+2)
... else:
... print(prefix + str(type(ch)) + " " + repr(ch.string))
... # print(prefix+repr(ch.string))
>>> soup = BeautifulSoup(bold_and_italic, "html.parser")
>>> walk(soup)
...
<class 'bs4.BeautifulSoup'> [document]:
<class 'bs4.element.Tag'> p:
<class 'bs4.element.NavigableString'> 'A '
<class 'bs4.element.Tag'> b:
<class 'bs4.element.NavigableString'> 'bold'
<class 'bs4.element.NavigableString'> ' and '
<class 'bs4.element.Tag'> i:
<class 'bs4.element.NavigableString'> 'italic'
<class 'bs4.element.NavigableString'> ' thing.'
>>> soup = BeautifulSoup(lorem_ipsum, "html.parser")
>>> walk(soup)
...
<class 'bs4.BeautifulSoup'> [document]:
<class 'bs4.element.Tag'> p:
<class 'bs4.element.NavigableString'> 'Lorem ipsum dolor sit amet, consectetur adipiscing elit.'
Further reading¶
See also Truncating HTML texts and Processing embedded images and Bleaching.