Welcome | Get started | Dive | Contribute | Topics | Reference | Changes | More
Truncating HTML texts¶
This document digs deeper into the truncate_comment() function.
The function was reimplemented in July 2023, triggered by #5039
(Comment with a <base> tag caused Jane to break).
The truncate_comment() function also sanitizes the content, which is done
by lino.utils.soup.sanitize().
This page contains code snippets (lines starting with >>>), which are
being tested during our development workflow. The following
snippet initializes the demo project used throughout this page.
>>> from lino_book.projects.min1.startup import *
>>> from lino.utils.soup import sanitized_soup
>>> from lino.utils.soup import truncate_comment as tc
Examples¶
>>> pasted = """<h1 style="color: #5e9ca0;">Styled comment
... <span style="color: #2b2301;">pasted from word!</span> </h1>"""
>>> print(tc(pasted))
...
Styled comment
<span style="color: #2b2301;">pasted from word!</span>
>>> print(tc(pasted, 17))
...
Styled comment
<span style="color: #2b2301;">pa...</span>
The first image remains but we enforce our style.
>>> from lino.utils.soup import SHORT_PREVIEW_IMAGE_HEIGHT
>>> SHORT_PREVIEW_IMAGE_HEIGHT
'8em'
>>> print(tc('<img src="foo" alt="bar"/></p>'))
<img alt="bar" src="foo" style="float:right;width:auto;height:8em"/>
>>> print(tc('<IMG SRC="foo" ALT="bar"/>'))
<img alt="bar" src="foo" style="float:right;width:auto;height:8em"/>
>>> two_images = """<p>First <img src="a.jpg"/> and <img src="b.jpg"/>.</p>"""
>>> tc(two_images)
'First <img src="a.jpg" style="float:right;width:auto;height:8em"/> and ⌧.'
Paragraph tags are replaced by a whitespace while inline tags remain:
>>> print(tc("Try<pre>rm -r /</pre>and you might regret."))
Try rm -r / and you might regret.
>>> print(tc("Try <i>rm -r /</i>and you might regret."))
Try <i>rm -r /</i>and you might regret.
Unknown tags get sanitized into a <span>:
>>> print(tc("Try <bad>rm -r /</bad>and you might regret."))
Try <span>rm -r /</span>and you might regret.
>>> print(tc('<p>A short paragraph</p><p><ul><li>first</li><li>second</li></ul></p>'))
A short paragraph first second
>>> print(tc('Some plain text.'))
Some plain text.
>>> print(tc('Two paragraphs of plain text.\n\n\nHere is the second paragraph.'))
Two paragraphs of plain text.
Here is the second paragraph.
>>> body = "<p>A <strong><p></strong> tag is better than a <div> tag.</p>"
>>> print(sanitized_soup(body))
<p>A <strong><p></strong> tag is better than a <div> tag.</p>
>>> print(tc(body))
A <strong><p></strong> tag is better than a <div> tag.
Truncation¶
>>> lorem_ipsum = '<p>Lorem ipsum dolor sit amet, consectetur adipiscing elit.</p>'
>>> bold_and_italic = "<p>A <b>bold</b> and <i>italic</i> thing."
>>> print(tc(bold_and_italic))
A <b>bold</b> and <i>italic</i> thing.
>>> print(tc(bold_and_italic, 5))
A <b>bol...</b>
>>> print(tc(bold_and_italic, 14))
A <b>bold</b> and <i>ita...</i>
The two following examples are cut at exactly the same place:
>>> print(tc(lorem_ipsum, 30))
Lorem ipsum dolor sit amet, co...
>>> print(tc('<p>Lorem <b>ipsum</b> dolor sit amet, consectetur adipiscing elit.</p>', 30))
Lorem <b>ipsum</b> dolor sit amet, co...
>>> print(tc('<p>Lorem <b>ipsum</b> dolor sit amet, consectetur adipiscing elit.</p>', 10))
Lorem <b>ipsu...</b>
>>> print(tc('<p>Lorem ipsum dolor sit amet</p><p>consectetur adipiscing elit.</p>', 30))
Lorem ipsum dolor sit amet cons...
Lorem ipsum dolor sit amet <BLANKLINE> cons…
>>> tc("<p>A plain paragraph with more than 20 characters.</p>", 20)
'A plain paragraph wi...'
Multiple paragraphs are summarized:
>>> tc("<p>aaaa.</p><p>bbbb.</p><p>cccc.</p><p>dddd.</p><p>eeee.</p>", 20)
'aaaa. bbbb. cccc. ddd...'
TODO: In above result there is one “d” too much at the end. Why?
>>> tc("<div>{}</div>".format(lorem_ipsum), 20)
'Lorem ipsum dolor si...'
Longer examples¶
The default max_length of truncate_comment() is 300. In Lino we can
override this default value in lino.modlib.memo.short_preview_length.
One day we made a copy of the English Wikipedia start page and stored it into our demo data for testing purposes. It contains 121 KB of data.
>>> from lino_book import DEMO_DATA
>>> html = (DEMO_DATA / "html" / "wikipedia.html").read_text()
>>> len(html)
121429
Let’s truncate it:
>>> from lino.utils.soup import truncate_comment as tc
>>> print(tc(html, 10))
<a class="mw-jump-link" href="#bodyContent">Jump to co...</a>
Even when truncated, the HTML is very long because it contains tags without textual content but with long class and style and title and src tags. So, according to our rules, even the short_preview of a Wikipedia page will take quite much space:
>>> len(tc(html, 100)) > 70000
True
TODO: The truncated HTML still contains more than one image (because TextCollector doesn’t descend into the children of <span> tags):
>>> print(tc(html, 100)[:1048])
...
<a class="mw-jump-link" href="#bodyContent">Jump to content</a>
<span>
<div class="vector-header-start">
<span>
<div class="vector-dropdown vector-main-menu-dropdown vector-button-flush-left vector-button-flush-right" id="vector-main-menu-dropdown" title="Main menu">
<span/>
<span><span class="vector-icon mw-ui-icon-menu mw-ui-icon-wikimedia-menu"></span>
<span class="vector-dropdown-label-text">Main menu</span>
</span>
<div class="vector-dropdown-content">
<div class="vector-unpinned-container" id="vector-main-menu-unpinned-container">
</div>
</div>
</div>
</span>
<a class="mw-logo" href="/wiki/Main_Page">
<img alt="" class="mw-logo-icon" height="50" src="/static/images/icons/wikipedia.png" width="50"/>
<span class="mw-logo-container skin-invert">
<img alt="Wikipedia" class="mw-logo-wordmark" src="/static/images/mobile/copyright/wikipedia-wordmark-en.svg" style="width: 7.5em; height: 1.125em;"/>
<img alt="The Free Encyclopedia" class="mw-logo-tagline" height="13" src="/static/images/mobile/copyright/wikipedia-tagline-en.svg" style
>>> print(tc("""<p>foo <html><head><base href="bar" target="_blank"></head><body></p><p>baz</p>"""))
...
foo <span/>baz
Let’s try to truncate a whole HTML page:
>>> html_str = """
... <!doctype html><html lang="en">
... <head><title>Bad Request (400)</title></head>
... <body>
... <h1>Bad Request (400)</h1>
... <p></p>
... </body>
... </html>"""
>>> print(tc(html_str))
Bad Request (400)
Fixed bugs¶
Truncates in the middle of a html tag¶
This section verifies that #5916 (truncate_comment truncates in the middle of a html tag) is fixed.
Simplified case:
>>> body = """<span class="a"><span class="b">1234</span></span> 5678 90"""
>>> print(tc(body, max_length=7))
<span class="a"><span class="b">1234</span></span> 56...
Full case:
>>> body = """After talking about it with <span class="mention"
... data-denotation-char="@"
... data-index="0" data-link="javascript:window.App.runAction({\'actorId\':
... \'users.AllUsers\', \'an\': \'detail\', \'rp\': null, \'status\':
... {\'record_id\': 347}})" data-title="Sharif Mehedi"
... data-value="8lurry">\ufeff<span
... contenteditable="false">@8lurry</span>\ufeff</span> : Yes, let\'s replace the
... card_layout by as_card(). And when working on this, also think about how to
... configure the width of the cards (as mentioned in #5385, BUT let\'s wait with
... actually doing this until we have a concrete use case for cards. Right now they
... are just a kind of nice gimmick.'"""
>>> print(tc(body))
After talking about it with <span class="mention" data-denotation-char="@"
data-index="0" data-link="javascript:window.App.runAction({'actorId':
'users.AllUsers', 'an': 'detail', 'rp': null, 'status': {'record_id': 347}})"
data-title="Sharif Mehedi" data-value="8lurry"><span
contenteditable="false">@8lurry</span></span> : Yes, let's replace the
card_layout by as_card(). And when working on this, also think about how to
configure the width of the cards (as mentioned in #5385, BUT let's wait with
actually doing this until we have a concrete use case for cards. Right now they
are just a ki...
Short preview of a comment doesn’t escape ‘<’ and ‘>’¶
Until 20250606, the truncated text did not escape “<” and “>”, causing #6142. Now it works:
>>> print(tc(r"Let's replace [url] memo commands by <a href> tags."))
Let's replace [url] memo commands by <a href> tags.
>>> print(tc("100 < 500"))
100 < 500
>>> print(tc(">>> print('Hello, world!')"))
>>> print('Hello, world!')
Screenshot doesn’t get scaled¶
The following snippet shows that #6786 (The screenshot in a comment about #6660 doesn’t get scaled) is fixed after 20260511.
>>> print(tc('<img src="1.png" height="auto" width="70%">'))
<img height="auto" src="1.png" width="70%"/>
If an <img> tag has both height and width specified as attributes,
and if they both either “auto” or a relative size (ending with “%”), then Lino
doesn’t add a style attribute.
Some edge cases¶
>>> print(tc("<cool>"))
<span></span>
>>> print(tc("<p></p>"))
>>> print(tc(""))
>>> print(tc(" "))
A surprising result:
>>> print(tc("<<<cool>>>"))
<<<span>>></span>
Exploring BeautifulSoup¶
>>> from bs4 import BeautifulSoup
>>> def walk(ch, indent=0):
... prefix = " " * indent
... if hasattr(ch, 'tag'):
... print(prefix + str(type(ch)) + " " + ch.name + ":")
... for c in ch.children:
... walk(c, indent+2)
... else:
... print(prefix + str(type(ch)) + " " + repr(ch.string))
... # print(prefix+repr(ch.string))
>>> soup = BeautifulSoup(bold_and_italic, "html.parser")
>>> walk(soup)
...
<class 'bs4.BeautifulSoup'> [document]:
<class 'bs4.element.Tag'> p:
<class 'bs4.element.NavigableString'> 'A '
<class 'bs4.element.Tag'> b:
<class 'bs4.element.NavigableString'> 'bold'
<class 'bs4.element.NavigableString'> ' and '
<class 'bs4.element.Tag'> i:
<class 'bs4.element.NavigableString'> 'italic'
<class 'bs4.element.NavigableString'> ' thing.'
>>> soup = BeautifulSoup(lorem_ipsum, "html.parser")
>>> walk(soup)
...
<class 'bs4.BeautifulSoup'> [document]:
<class 'bs4.element.Tag'> p:
<class 'bs4.element.NavigableString'> 'Lorem ipsum dolor sit amet, consectetur adipiscing elit.'
Comment with a <base> tag caused Jane to break¶
The following snippet shows that #5039 is fixed after 20250606. HTML tags that are escaped in the source text must remain escaped in the result.