Welcome | Get started | Dive | Contribute | Topics | Reference | Changes | More

Truncating HTML texts

This document is about the truncate_comment() function, the purpose of which is to summarize a HTML text into a single paragraph. , which is done with the help of BeautifulSoup.

The function was reimplemented in July 2023, triggered by #5039 (Comment with a <base> tag caused Jane to break). In February 2025 we had #5916 (truncate_comment truncates in the middle of a html tag)

This page contains code snippets (lines starting with >>>), which are being tested during our development workflow. The following snippet initializes the demo project used throughout this page.

>>> from lino.utils.soup import truncate_comment as tc
>>> from lino.utils.soup import sanitize

Examples

>>> pasted = """<h1 style="color: #5e9ca0;">Styled comment
... <span style="color: #2b2301;">pasted from word!</span> </h1>"""
>>> print(tc(pasted))
...
Styled comment
<span style="color: #2b2301;">pasted from word!</span>
>>> print(tc(pasted, 17))
...
Styled comment
<span style="color: #2b2301;">pa...</span>

The first image remains but we enforce our style.

>>> from lino.utils.soup import SHORT_PREVIEW_IMAGE_HEIGHT
>>> SHORT_PREVIEW_IMAGE_HEIGHT
'8em'
>>> print(tc('<img src="foo" alt="bar"/></p>'))
<img alt="bar" src="foo" style="float:right;height:8em"/>
>>> print(tc('<IMG SRC="foo" ALT="bar"/>'))
<img alt="bar" src="foo" style="float:right;height:8em"/>
>>> two_images = """<p>First <img src="a.jpg"/> and <img src="b.jpg"/>.</p>"""
>>> tc(two_images)
'First <img src="a.jpg" style="float:right;height:8em"/> and ⌧.'

Paragraph tags are replaced by a whitespace while inline tags remain:

>>> print(tc("Try<pre>rm -r /</pre>and you might regret."))
Try rm -r / and you might regret.
>>> print(tc("Try<i>rm -r /</i>and you might regret."))
Try<i>rm -r /</i>and you might regret.

Unknown tags also remain:

>>> print(tc("Try<bad>rm -r /</bad>and you might regret."))
Try<bad>rm -r /</bad>and you might regret.
>>> print(tc('<p>A short paragraph</p><p><ul><li>first</li><li>second</li></ul></p>'))
A short paragraph first second
>>> print(tc('Some plain text.'))
Some plain text.

Truncation

>>> bold_and_italic = "<p>A <b>bold</b> and <i>italic</i> thing."
>>> lorem_ipsum = '<p>Lorem ipsum dolor sit amet, consectetur adipiscing elit.</p>'
>>> print(tc(bold_and_italic))
A <b>bold</b> and <i>italic</i> thing.
>>> print(tc(bold_and_italic, 5))
A <b>bol...</b>
>>> print(tc(bold_and_italic, 14))
A <b>bold</b> and <i>ita...</i>

The two following examples are cut at exactly the same place:

>>> print(tc(lorem_ipsum, 30))
Lorem ipsum dolor sit amet, co...
>>> print(tc('<p>Lorem <b>ipsum</b> dolor sit amet, consectetur adipiscing elit.</p>', 30))
Lorem <b>ipsum</b> dolor sit amet, co...
>>> print(tc('<p>Lorem <b>ipsum</b> dolor sit amet, consectetur adipiscing elit.</p>', 10))
Lorem <b>ipsu...</b>
>>> print(tc('<p>Lorem ipsum dolor sit amet</p><p>consectetur adipiscing elit.</p>', 30))
Lorem ipsum dolor sit amet cons...

Lorem ipsum dolor sit amet <BLANKLINE> cons…

>>> tc("<p>A plain paragraph with more than 20 characters.</p>", 20)
'A plain paragraph wi...'

Multiple paragraphs are summarized:

>>> tc("<p>aaaa.</p><p>bbbb.</p><p>cccc.</p><p>dddd.</p><p>eeee.</p>", 20)
'aaaa. bbbb. cccc. ddd...'

TODO: In above result there is one “d” too much at the end. Why?

>>> tc("<div>{}</div>".format(lorem_ipsum), 20)
'Lorem ipsum dolor si...'

Longer examples

The default max_length is 300. In Lino you can override this default value in memo.short_preview_length.

>>> from lino_book import DEMO_DATA
>>> html = (DEMO_DATA / "html" / "wikipedia.html").read_text()
>>> html = sanitize(html)
>>> print(tc(html, 10))
<a class="mw-jump-link" href="#bodyContent">Jump to co...</a>

Even when truncated, the HTML is very long because it contains tags without textual content but with long class and style and title and src tags. So, according to our rules, even the short_preview of a Wikipedia page will take quite much space:

>>> len(tc(html, 100))
71381

TODO: The truncated HTML still contains more than one image (because TextCollector doesn’t descend into the children of <span> tags)

>>> print(tc(html, 100)[:1000])
<a class="mw-jump-link" href="#bodyContent">Jump to content</a>

<span>
<div class="vector-header-start">
<span>
<div class="vector-dropdown vector-main-menu-dropdown vector-button-flush-left vector-button-flush-right" title="Main menu">
<span></span>
<span><span class="vector-icon mw-ui-icon-menu mw-ui-icon-wikimedia-menu"></span>
<span class="vector-dropdown-label-text">Main menu</span>
</span>
<div class="vector-dropdown-content">
<div class="vector-unpinned-container">
</div>
</div>
</div>
</span>
<a class="mw-logo" href="/wiki/Main_Page">
<img alt="" class="mw-logo-icon" src="/static/images/icons/wikipedia.png"/>
<span class="mw-logo-container skin-invert">
<img alt="Wikipedia" class="mw-logo-wordmark" src="/static/images/mobile/copyright/wikipedia-wordmark-en.svg" style="width: 7.5em; height: 1.125em;"/>
<img alt="The Free Encyclopedia" class="mw-logo-tagline" src="/static/images/mobile/copyright/wikipedia-tagline-en.svg" style="width: 7.3125em; height: 0.8125em;"/>
</span>
</a>
>>> print(tc('Two paragraphs of plain text.\n\n\nHere is the second paragraph.'))
Two paragraphs of plain text.


Here is the second paragraph.

BeautifulSoup

>>> from bs4 import BeautifulSoup
>>> def walk(ch, indent=0):
...    prefix = " " * indent
...    if hasattr(ch, 'tag'):
...      print(prefix + str(type(ch)) + " " + ch.name + ":")
...      for c in ch.children:
...        walk(c, indent+2)
...    else:
...      print(prefix + str(type(ch)) + " " + repr(ch.string))
...      # print(prefix+repr(ch.string))
>>> soup = BeautifulSoup(bold_and_italic, "html.parser")
>>> walk(soup)
...
<class 'bs4.BeautifulSoup'> [document]:
  <class 'bs4.element.Tag'> p:
    <class 'bs4.element.NavigableString'> 'A '
    <class 'bs4.element.Tag'> b:
      <class 'bs4.element.NavigableString'> 'bold'
    <class 'bs4.element.NavigableString'> ' and '
    <class 'bs4.element.Tag'> i:
      <class 'bs4.element.NavigableString'> 'italic'
    <class 'bs4.element.NavigableString'> ' thing.'
>>> soup = BeautifulSoup(lorem_ipsum, "html.parser")
>>> walk(soup)
...
<class 'bs4.BeautifulSoup'> [document]:
  <class 'bs4.element.Tag'> p:
    <class 'bs4.element.NavigableString'> 'Lorem ipsum dolor sit amet, consectetur adipiscing elit.'

Sanitizing

The truncate_comment function also does basic sanitizing.

>>> print(tc("""<p>foo <html><head><base href="bar" target="_blank"></head><body></p><p>baz</p>"""))
...
foo

baz

Let’s try to truncate a whole HTML page:

>>> html_str = """
... <!doctype html><html lang="en">
... <head><title>Bad Request (400)</title></head>
... <body>
... <h1>Bad Request (400)</h1>
... <p></p>
... </body>
... </html>"""
>>> print(tc(html_str))
html
<title>Bad Request (400)</title>

Bad Request (400)


Verifying #5916 (truncate_comment truncates in the middle of a html tag):

Simplified case:

>>> body = """<span class="a"><span class="b">1234</span></span> 5678 90"""
>>> print(tc(body, max_length=7))
<span class="a"><span class="b">1234</span></span> 56...

Full case:

>>> body = """After talking about it with <span class="mention"
... data-denotation-char="@"
... data-index="0" data-link="javascript:window.App.runAction({\'actorId\':
... \'users.AllUsers\', \'an\': \'detail\', \'rp\': null, \'status\':
... {\'record_id\': 347}})" data-title="Sharif Mehedi"
... data-value="8lurry">\ufeff<span
... contenteditable="false">@8lurry</span>\ufeff</span> : Yes, let\'s replace the
... card_layout by as_card(). And when working on this, also think about how to
... configure the width of the cards (as mentioned in #5385, BUT let\'s wait with
... actually doing this until we have a concrete use case for cards. Right now they
... are just a kind of nice gimmick.'"""
>>> print(tc(body))
After talking about it with <span class="mention" data-denotation-char="@"
data-index="0" data-link="javascript:window.App.runAction({'actorId':
'users.AllUsers', 'an': 'detail', 'rp': null, 'status': {'record_id': 347}})"
data-title="Sharif Mehedi" data-value="8lurry"><span
contenteditable="false">@8lurry</span></span> : Yes, let's replace the
card_layout by as_card(). And when working on this, also think about how to
configure the width of the cards (as mentioned in #5385, BUT let's wait with
actually doing this until we have a concrete use case for cards. Right now they
are just a ki...

Fixed bugs

Until 20250606, the truncated text did not escape “<” and “>”, causing #6142. Now it works:

>>> print(tc(r"Let's replace [url] memo commands by &lt;a href&gt; tags."))
Let's replace [url] memo commands by &lt;a href&gt; tags.
>>> print(tc("100 < 500"))
100 &lt; 500
>>> print(tc(">>> print('Hello, world!')"))
&gt;&gt;&gt; print('Hello, world!')

The following snippet shows that #5039 is fixed after 20250606. HTML tags that are escaped in the source text must remain escaped in the result.

>>> escaped_html = """
... <p>For example
... &lt;html&gt;&lt;head&gt;&lt;base href="bar" target="_blank"&gt;&lt;/head&gt;
... &lt;body&gt;
... </p>"""
>>> print(tc(escaped_html))
...
For example
&lt;html&gt;&lt;head&gt;&lt;base href="bar" target="_blank"&gt;&lt;/head&gt;
&lt;body&gt;

TODO

Some surprising results:

>>> print(tc("<cool>"))
<cool></cool>
>>> print(tc("<<<cool>>>"))
&lt;&lt;<cool>&amp;gt;&amp;gt;</cool>