Welcome | Get started | Dive | Contribute | Topics | Reference | Changes | More

Truncating HTML texts

This document is about the truncate_comment function, the purpose of which is to summarize a HTML text into a single paragraph, which is done with the help of BeautifulSoup.

The function was reimplemented in July 2023, triggered by #5039 (Comment with a <base> tag caused Jane to break). In February 2025 we had #5916 (truncate_comment truncates in the middle of a html tag)

Note

Code snippets in this document (lines starting with >>>) get tested as part of our development workflow. The following initialization snippet tells you which demo project is being used.

>>> from lino.utils.soup import truncate_comment as tc
>>> from lino.utils.soup import sanitize

Examples

>>> pasted = """<h1 style="color: #5e9ca0;">Styled comment
... <span style="color: #2b2301;">pasted from word!</span> </h1>"""
>>> print(tc(pasted))
...
Styled comment
<span style="color: #2b2301;">pasted from word!</span>
>>> print(tc(pasted, 17))
...
Styled comment
<span style="color: #2b2301;">pa...</span>

Styled comment pasted from word!

>>> print(tc('<img src="foo" alt="bar"/></p>'))
<img alt="bar" src="foo" style="float:right;height:8em"/>
>>> print(tc('<IMG SRC="foo" ALT="bar"/>'))
<img alt="bar" src="foo" style="float:right;height:8em"/>
>>> from lino.utils.soup import SHORT_PREVIEW_IMAGE_HEIGHT
>>> SHORT_PREVIEW_IMAGE_HEIGHT
'8em'
>>> print(tc('<p>A short paragraph</p><p><ul><li>first</li><li>second</li></ul></p>'))
A short paragraph first second
>>> print(tc('Some plain text.'))
Some plain text.

Truncation

>>> bold_and_italic = "<p>A <b>bold</b> and <i>italic</i> thing."
>>> lorem_ipsum = '<p>Lorem ipsum dolor sit amet, consectetur adipiscing elit.</p>'
>>> print(tc(bold_and_italic))
A <b>bold</b> and <i>italic</i> thing.
>>> print(tc(bold_and_italic, 5))
A <b>bol...</b>
>>> print(tc(bold_and_italic, 14))
A <b>bold</b> and <i>ita...</i>

The two following examples are cut at exactly the same place:

>>> print(tc(lorem_ipsum, 30))
Lorem ipsum dolor sit amet, co...
>>> print(tc('<p>Lorem <b>ipsum</b> dolor sit amet, consectetur adipiscing elit.</p>', 30))
Lorem <b>ipsum</b> dolor sit amet, co...
>>> print(tc('<p>Lorem <b>ipsum</b> dolor sit amet, consectetur adipiscing elit.</p>', 10))
Lorem <b>ipsu...</b>
>>> print(tc('<p>Lorem ipsum dolor sit amet</p><p>consectetur adipiscing elit.</p>', 30))
Lorem ipsum dolor sit amet cons...

Lorem ipsum dolor sit amet <BLANKLINE> cons…

>>> tc("<p>A plain paragraph with more than 20 characters.</p>", 20)
'A plain paragraph wi...'

Multiple paragraphs are summarized:

>>> tc("<p>aaaa.</p><p>bbbb.</p><p>cccc.</p><p>dddd.</p><p>eeee.</p>", 20)
'aaaa. bbbb. cccc. ddd...'

TODO: In above result there is one “d” too much at the end. Why?

>>> tc("<div>{}</div>".format(lorem_ipsum), 20)
'Lorem ipsum dolor si...'
>>> two_images = """<p>First <img src="a.jpg"/> and <img src="b.jpg"/>.</p>"""
>>> tc(two_images)
'First <img src="a.jpg" style="float:right;height:8em"/> and ⌧.'

Longer examples

The default max_length is 300. In Lino you can override this default value in memo.short_preview_length.

>>> from lino_book import DEMO_DATA
>>> html = (DEMO_DATA / "html" / "wikipedia.html").read_text()
>>> html = sanitize(html)
>>> print(tc(html, 10))
<a class="mw-jump-link" href="#bodyContent">Jump to co...</a>

Even when truncated, the HTML is very long because it contains tags without textual content but with long class and style and title and src tags. So, according to our rules, even the short_preview of a Wikipedia page will take quite much space:

>>> len(tc(html, 100))
71310

TODO: The truncated HTML still contains more than one image (because TextCollector doesn’t descend into the children of <span> tags)

>>> print(tc(html, 100)[:1000])
<a class="mw-jump-link" href="#bodyContent">Jump to content</a>

<span>
<div class="vector-header-start">
<span>
<div class="vector-dropdown vector-main-menu-dropdown vector-button-flush-left vector-button-flush-right" title="Main menu">
<span></span>
<span><span class="vector-icon mw-ui-icon-menu mw-ui-icon-wikimedia-menu"></span>
<span class="vector-dropdown-label-text">Main menu</span>
</span>
<div class="vector-dropdown-content">
<div class="vector-unpinned-container">
</div>
</div>
</div>
</span>
<a class="mw-logo" href="/wiki/Main_Page">
<img alt="" class="mw-logo-icon" src="/static/images/icons/wikipedia.png"/>
<span class="mw-logo-container skin-invert">
<img alt="Wikipedia" class="mw-logo-wordmark" src="/static/images/mobile/copyright/wikipedia-wordmark-en.svg" style="width: 7.5em; height: 1.125em;"/>
<img alt="The Free Encyclopedia" class="mw-logo-tagline" src="/static/images/mobile/copyright/wikipedia-tagline-en.svg" style="width: 7.3125em; height: 0.8125em;"/>
</span>
</a>
>>> print(tc('Two paragraphs of plain text.\n\n\nHere is the second paragraph.'))
Two paragraphs of plain text.


Here is the second paragraph.

BeautifulSoup

>>> from bs4 import BeautifulSoup
>>> def walk(ch, indent=0):
...    prefix = " " * indent
...    if hasattr(ch, 'tag'):
...      print(prefix + str(type(ch)) + " " + ch.name + ":")
...      for c in ch.children:
...        walk(c, indent+2)
...    else:
...      print(prefix + str(type(ch)) + " " + repr(ch.string))
...      # print(prefix+repr(ch.string))
>>> soup = BeautifulSoup(bold_and_italic, "html.parser")
>>> walk(soup)
...
<class 'bs4.BeautifulSoup'> [document]:
  <class 'bs4.element.Tag'> p:
    <class 'bs4.element.NavigableString'> 'A '
    <class 'bs4.element.Tag'> b:
      <class 'bs4.element.NavigableString'> 'bold'
    <class 'bs4.element.NavigableString'> ' and '
    <class 'bs4.element.Tag'> i:
      <class 'bs4.element.NavigableString'> 'italic'
    <class 'bs4.element.NavigableString'> ' thing.'
>>> soup = BeautifulSoup(lorem_ipsum, "html.parser")
>>> walk(soup)
...
<class 'bs4.BeautifulSoup'> [document]:
  <class 'bs4.element.Tag'> p:
    <class 'bs4.element.NavigableString'> 'Lorem ipsum dolor sit amet, consectetur adipiscing elit.'

Sanitizing

The truncate_comment function also does basic sanitizing.

>>> print(tc("""<p>foo <html><head><base href="bar" target="_blank"></head><body></p><p>baz</p>"""))
...
foo

baz

Let’s try to truncate a whole HTML page:

>>> html_str = """
... <!doctype html><html lang="en">
... <head><title>Bad Request (400)</title></head>
... <body>
... <h1>Bad Request (400)</h1>
... <p></p>
... </body>
... </html>"""
>>> print(tc(html_str))
html
<title>Bad Request (400)</title>

Bad Request (400)


Verifying #5916 (truncate_comment truncates in the middle of a html tag):

Simplified case:

>>> body = """<span class="a"><span class="b">1234</span></span> 5678 90"""
>>> print(tc(body, max_length=7))
<span class="a"><span class="b">1234</span></span> 56...

Full case:

>>> body = """After talking about it with <span class="mention"
... data-denotation-char="@"
... data-index="0" data-link="javascript:window.App.runAction({\'actorId\':
... \'users.AllUsers\', \'an\': \'detail\', \'rp\': null, \'status\':
... {\'record_id\': 347}})" data-title="Sharif Mehedi"
... data-value="8lurry">\ufeff<span
... contenteditable="false">@8lurry</span>\ufeff</span> : Yes, let\'s replace the
... card_layout by as_card(). And when working on this, also think about how to
... configure the width of the cards (as mentioned in #5385, BUT let\'s wait with
... actually doing this until we have a concrete use case for cards. Right now they
... are just a kind of nice gimmick.'"""
>>> print(tc(body))
After talking about it with <span class="mention" data-denotation-char="@"
data-index="0" data-link="javascript:window.App.runAction({'actorId':
'users.AllUsers', 'an': 'detail', 'rp': null, 'status': {'record_id': 347}})"
data-title="Sharif Mehedi" data-value="8lurry"><span
contenteditable="false">@8lurry</span></span> : Yes, let's replace the
card_layout by as_card(). And when working on this, also think about how to
configure the width of the cards (as mentioned in #5385, BUT let's wait with
actually doing this until we have a concrete use case for cards. Right now they
are just a ki...

TODO

TODO: the following snippet is skipped because #5039 is not yet fixed. HTML tags that are escaped in the source text must remain escaped in the result.

>>> escaped_html = """
... <p>For example
... &lt;html&gt;&lt;head&gt;&lt;base href="bar" target="_blank"&gt;&lt;/head&gt;
... &lt;body&gt;
... </p>
... """
>>> print(tc(escaped_html))
...
For example
&lt;html&gt;&lt;head&gt;&lt;base href="bar" target="_blank"&gt;&lt;/head&gt;&lt;body&gt;