Welcome | Get started | Dive into Lino | Contribute | Topics | Reference | More

Truncating HTML texts

This document is about the truncate_comment function, the purpose of which is to summarize a HTML text into a single paragraph.

The function was reimplemented in July 2023, triggered by #5039 (Comment with a <base> tag caused Jane to break).

Both the old and the new implementation use BeautifulSoup.

This page is a tested document and the following instructions are used for initialization:

>>> from lino import startup
>>> startup('lino_book.projects.noi1e.settings.demo')
>>> from lino.api.doctest import *
>>> from lino.modlib.memo.mixins import truncate_comment as tc

Examples

>>> bold_and_italic = "<p>A <b>bold</b> and <i>italic</i> thing."
>>> lorem_ipsum = '<p>Lorem ipsum dolor sit amet, consectetur adipiscing elit.</p>'
>>> print(tc('<h1 style="color: #5e9ca0;">Styled comment <span style="color: #2b2301;">pasted from word!</span> </h1>'))
... 
Styled comment pasted from word!
>>> print(tc('<img src="foo" alt="bar"/></p>'))
<img alt="bar" src="foo" style="float:right;height:8em"/>
>>> dd.plugins.memo.short_preview_image_height
'8em'
>>> print(tc('<p>A short paragraph</p><p><ul><li>first</li><li>second</li></ul></p>'))
A short paragraph

first

second
>>> settings.SITE.plugins.memo.short_preview_length
300
>>> html = '<p>Ich habe Hirn, ich will hier raus! &ndash; Wie im Netz der Flachsinn regiert.</p>\n<ul>\n<li>Ver&ouml;ffentlicht:&nbsp;6. Mai 2017</li>\n<li>Vorgestellt in:&nbsp;<a href="https://www.linkedin.com/pulse/feed/channel/deutsch"><span>Favoriten der Redaktion</span></a>,&nbsp;<a href="https://www.linkedin.com/pulse/feed/channel/jobs"><span>Job &amp; Karriere</span></a>,&nbsp;<a href="https://www.linkedin.com/pulse/feed/channel/verkauf"><span>Marketing &amp; Verkauf</span></a>,&nbsp;<a href="https://www.linkedin.com/pulse/feed/channel/technologie"><span>Technologie &amp; Internet</span></a>,&nbsp;<a href="https://www.linkedin.com/pulse/feed/channel/wochenendLekture"><span>Wochenend-Lekt&uuml;re</span></a></li>\n</ul>\n<ul>\n<li><span><span>Gef&auml;llt mir</span></span><span>Ich habe Hirn, ich will hier raus! &ndash; Wie im Netz der Flachsinn regiert</span>\n<p>&nbsp;</p>\n<a href="https://www.linkedin.com/pulse/ich-habe-hirn-hier-raus-wie-im-netz-der-flachsinn-regiert-dueck"><span>806</span></a></li>\n<li><span>Kommentar</span>\n<p>&nbsp;</p>\n<a href="https://www.linkedin.com/pulse/ich-habe-hirn-hier-raus-wie-im-netz-der-flachsinn-regiert-dueck#comments"><span>42</span></a></li>\n<li><span>Teilen</span><span>Ich habe Hirn, ich will hier raus! &ndash; Wie im Netz der Flachsinn regiert teilen</span>\n<p>&nbsp;</p>\n<span>131</span></li>\n</ul>\n<p><a href="https://www.linkedin.com/in/gunterdueck"><span>Gunter Dueck</span></a> <span>Folgen</span><span>Gunter Dueck</span> Philosopher, Writer, Keynote Speaker</p>\n<p>Das Smartphone vibriert, klingelt oder surrt. Zing! Das ist der Messenger. Eine Melodie von eBay zeigt an, dass eine Auktion in den n&auml;chsten Minuten endet. Freunde schicken Fotos, News versprechen uns "Drei Minuten, nach denen du bestimmt lange weinen musst" oder "Wenn du dieses Bild siehst, wird sich dein Leben auf der Stelle f&uuml;r immer ver&auml;ndern".</p>\n<p>Politiker betreiben statt ihrer eigentlichen Arbeit nun simples Selbstmarketing und fordern uns auf, mal schnell unser Verhalten zu &auml;ndern &ndash; am besten nat&uuml;rlich "langfristig" und "nachhaltig". Manager fordern harsch immer mehr Extrameilen von uns ein, die alle ihre (!) Probleme beseitigen, und es gibt f&uuml;r jede Schieflage in unserem Leben Rat von allerlei Coaches und Therapeuten, es gibt Heilslehren und Globuli.</p>'
>>> print(tc(html))  
Ich habe Hirn, ich will hier raus! – Wie im Netz der Flachsinn regiert.



Veröffentlicht: 6. Mai 2017


Vorgestellt in: <a href="https://www.linkedin.com/pulse/feed/channel/deutsch"><span>Favoriten der Redaktion</span></a>, <a href="https://www.linkedin.com/pulse/feed/channel/jobs"><span>Job &amp; Karriere</span></a>, <a href="https://www.linkedin.com/pulse/feed/channel/verkauf"><span>Marketing &amp; Verkauf</span></a>, <a href="https://www.linkedin.com/pulse/feed/channel/technologie"><span>Technologie &amp; Internet</span></a>, <a href="https://www.linkedin.com/pulse/feed/channel/wochenendLekture"><span>Wochenend-Lektüre</span></a>




Gefällt mir Ich habe Hirn, ich will hier raus! – Wie im Netz der Flachs...
>>> print(tc('Some plain text.'))
Some plain text.
>>> print(tc('Two paragraphs of plain text.\n\n\nHere is the second paragraph.'))
Two paragraphs of plain text.


Here is the second paragraph.

BeautifulSoup

>>> def walk(ch, indent=0):
...    prefix = " " * indent
...    if hasattr(ch, 'tag'):
...      print(prefix + str(type(ch)) + " " + ch.name + ":")
...      for c in ch.children:
...        walk(c, indent+2)
...    else:
...      print(prefix + str(type(ch)) + " " + repr(ch.string))
...      # print(prefix+repr(ch.string))
>>> soup = BeautifulSoup(bold_and_italic, "html.parser")
>>> walk(soup)
... 
<class 'bs4.BeautifulSoup'> [document]:
  <class 'bs4.element.Tag'> p:
    <class 'bs4.element.NavigableString'> 'A '
    <class 'bs4.element.Tag'> b:
      <class 'bs4.element.NavigableString'> 'bold'
    <class 'bs4.element.NavigableString'> ' and '
    <class 'bs4.element.Tag'> i:
      <class 'bs4.element.NavigableString'> 'italic'
    <class 'bs4.element.NavigableString'> ' thing.'
>>> soup = BeautifulSoup(lorem_ipsum, "html.parser")
>>> walk(soup)
... 
<class 'bs4.BeautifulSoup'> [document]:
  <class 'bs4.element.Tag'> p:
    <class 'bs4.element.NavigableString'> 'Lorem ipsum dolor sit amet, consectetur adipiscing elit.'

Truncation

>>> print(tc(bold_and_italic))
A <b>bold</b> and <i>italic</i> thing.
>>> print(tc(bold_and_italic, 5))
A <b>bol...</b>
>>> print(tc(bold_and_italic, 14))
A <b>bold</b> and <i>ita...</i>
>>> print(tc(lorem_ipsum, 30))
Lorem ipsum dolor sit amet, co...
>>> print(tc('<p>Lorem <b>ipsum</b> dolor sit amet, consectetur adipiscing elit.</p>', 30))
Lorem <b>ipsum</b> dolor sit amet, co...
>>> print(tc('<p>Lorem <b>ipsum</b> dolor sit amet, consectetur adipiscing elit.</p>', 10))
Lorem <b>ipsu...</b>
>>> print(tc('<p>Lorem ipsum dolor sit amet</p><p>consectetur adipiscing elit.</p>', 30))
Lorem ipsum dolor sit amet

cons...
>>> tc("<p>A plain paragraph with more than 20 characters.</p>", 20)
'A plain paragraph wi...'

Multiple paragraphs are summarized:

>>> tc("<p>aaaa.</p><p>bbbb.</p><p>cccc.</p><p>dddd.</p><p>eeee.</p>", 20)
'aaaa.\n\nbbbb.\n\ncccc.\n\nd...'
>>> tc("<div>{}</div>".format(lorem_ipsum), 20)
'Lorem ipsum dolor si...'

Sanitizing

The truncate_comment function also does basic sanitizing.

>>> print(tc("""<p>foo <html><head><base href="bar" target="_blank"></head><body></p><p>baz</p>"""))
... 
foo

baz

Let's try to truncate a whole HTML page:

>>> html_str = """
... <!doctype html><html lang="en">
... <head><title>Bad Request (400)</title></head>
... <body>
... <h1>Bad Request (400)</h1>
... <p></p>
... </body>
... </html>"""
>>> print(tc(html_str)) 
html
<title>Bad Request (400)</title>

Bad Request (400)






TODO

TODO: the following snippet is skipped because #5039 is not yet fixed. HTML tags that are escaped in the source text must remain escaped in the result.

>>> print(tc("""<p>foo &lt;html&gt;&lt;head&gt;&lt;base href="bar" target="_blank"&gt;&lt;/head&gt;&lt;body&gt;</p><p>baz</p>"""))
... 
foo &lt;html&gt;&lt;head&gt;&lt;base href="bar" target="_blank"&gt;&lt;/head&gt;&lt;body&gt;

baz