Welcome | Get started | Dive into Lino | Contribute | Reference
Truncating HTML texts¶
This document is about the lino.modlib.memo.truncate_comment()
function,
the purpose of which is to summarize a HTML text as a single paragraph.
The function was reimplemented in July 2023, triggered by #5039
(Comment with a <base>
tag caused Jane to break).
Both the old and the new implementation use BeautifulSoup.
This is a tested document. The following instructions are used for initialization:
>>> from lino import startup
>>> startup('lino_book.projects.noi1e.settings.demo')
>>> from lino.api.doctest import *
>>> from lino.modlib.memo.mixins import truncate_comment as tc
Examples¶
>>> bold_and_italic = "<p>A <b>bold</b> and <i>italic</i> thing."
>>> lorem_ipsum = '<p>Lorem ipsum dolor sit amet, consectetur adipiscing elit.</p>'
>>> print(tc('<h1 style="color: #5e9ca0;">Styled comment <span style="color: #2b2301;">pasted from word!</span> </h1>'))
...
Styled comment pasted from word!
>>> print(tc('<img src="foo" alt="bar"/></p>'))
<img alt="bar" src="foo"/>
>>> print(tc('<p>A short paragraph</p><p><ul><li>first</li><li>second</li></ul></p>'))
A short paragraph
first
second
>>> settings.SITE.plugins.memo.short_preview_length
300
>>> html = '<p>Ich habe Hirn, ich will hier raus! – Wie im Netz der Flachsinn regiert.</p>\n<ul>\n<li>Veröffentlicht: 6. Mai 2017</li>\n<li>Vorgestellt in: <a href="https://www.linkedin.com/pulse/feed/channel/deutsch"><span>Favoriten der Redaktion</span></a>, <a href="https://www.linkedin.com/pulse/feed/channel/jobs"><span>Job & Karriere</span></a>, <a href="https://www.linkedin.com/pulse/feed/channel/verkauf"><span>Marketing & Verkauf</span></a>, <a href="https://www.linkedin.com/pulse/feed/channel/technologie"><span>Technologie & Internet</span></a>, <a href="https://www.linkedin.com/pulse/feed/channel/wochenendLekture"><span>Wochenend-Lektüre</span></a></li>\n</ul>\n<ul>\n<li><span><span>Gefällt mir</span></span><span>Ich habe Hirn, ich will hier raus! – Wie im Netz der Flachsinn regiert</span>\n<p> </p>\n<a href="https://www.linkedin.com/pulse/ich-habe-hirn-hier-raus-wie-im-netz-der-flachsinn-regiert-dueck"><span>806</span></a></li>\n<li><span>Kommentar</span>\n<p> </p>\n<a href="https://www.linkedin.com/pulse/ich-habe-hirn-hier-raus-wie-im-netz-der-flachsinn-regiert-dueck#comments"><span>42</span></a></li>\n<li><span>Teilen</span><span>Ich habe Hirn, ich will hier raus! – Wie im Netz der Flachsinn regiert teilen</span>\n<p> </p>\n<span>131</span></li>\n</ul>\n<p><a href="https://www.linkedin.com/in/gunterdueck"><span>Gunter Dueck</span></a> <span>Folgen</span><span>Gunter Dueck</span> Philosopher, Writer, Keynote Speaker</p>\n<p>Das Smartphone vibriert, klingelt oder surrt. Zing! Das ist der Messenger. Eine Melodie von eBay zeigt an, dass eine Auktion in den nächsten Minuten endet. Freunde schicken Fotos, News versprechen uns "Drei Minuten, nach denen du bestimmt lange weinen musst" oder "Wenn du dieses Bild siehst, wird sich dein Leben auf der Stelle für immer verändern".</p>\n<p>Politiker betreiben statt ihrer eigentlichen Arbeit nun simples Selbstmarketing und fordern uns auf, mal schnell unser Verhalten zu ändern – am besten natürlich "langfristig" und "nachhaltig". Manager fordern harsch immer mehr Extrameilen von uns ein, die alle ihre (!) Probleme beseitigen, und es gibt für jede Schieflage in unserem Leben Rat von allerlei Coaches und Therapeuten, es gibt Heilslehren und Globuli.</p>'
>>> print(tc(html))
Ich habe Hirn, ich will hier raus! – Wie im Netz der Flachsinn regiert.
Veröffentlicht: 6. Mai 2017
Vorgestellt in: <a href="https://www.linkedin.com/pulse/feed/channel/deutsch"><span>Favoriten der Redaktion</span></a>, <a href="https://www.linkedin.com/pulse/feed/channel/jobs"><span>Job & Karriere</span></a>, <a href="https://www.linkedin.com/pulse/feed/channel/verkauf"><span>Marketing & Verkauf</span></a>, <a href="https://www.linkedin.com/pulse/feed/channel/technologie"><span>Technologie & Internet</span></a>, <a href="https://www.linkedin.com/pulse/feed/channel/wochenendLekture"><span>Wochenend-Lektüre</span></a>
Gefällt mir Ich habe Hirn, ich will hier raus! – Wie im Netz der Flachs...
>>> print(tc('Some plain text.'))
Some plain text.
>>> print(tc('Two paragraphs of plain text.\n\n\nHere is the second paragraph.'))
Two paragraphs of plain text.
Here is the second paragraph.
Note that truncate does not sanitize: it does not try to remove dangerous html (because this must be done also for non-truncated HTML and is the job of bleach)
>>> print(tc("""<p>foo <html><head><base href="bar" target="_blank"></head><body></p><p>baz</p>"""))
foo <html><head><base href="bar" target="_blank"/></head><body></body></html>
baz
BeautifulSoup¶
>>> def walk(ch, indent=0):
... prefix = " " * indent
... if hasattr(ch, 'tag'):
... print(prefix + str(type(ch)) + " " + ch.name + ":")
... for c in ch.children:
... walk(c, indent+2)
... else:
... print(prefix + str(type(ch)) + " " + repr(ch.string))
... # print(prefix+repr(ch.string))
>>> soup = BeautifulSoup(bold_and_italic, "html.parser")
>>> walk(soup)
...
<class 'bs4.BeautifulSoup'> [document]:
<class 'bs4.element.Tag'> p:
<class 'bs4.element.NavigableString'> 'A '
<class 'bs4.element.Tag'> b:
<class 'bs4.element.NavigableString'> 'bold'
<class 'bs4.element.NavigableString'> ' and '
<class 'bs4.element.Tag'> i:
<class 'bs4.element.NavigableString'> 'italic'
<class 'bs4.element.NavigableString'> ' thing.'
>>> soup = BeautifulSoup(lorem_ipsum, "html.parser")
>>> walk(soup)
...
<class 'bs4.BeautifulSoup'> [document]:
<class 'bs4.element.Tag'> p:
<class 'bs4.element.NavigableString'> 'Lorem ipsum dolor sit amet, consectetur adipiscing elit.'
Truncation¶
>>> print(tc(bold_and_italic))
A <b>bold</b> and <i>italic</i> thing.
>>> print(tc(bold_and_italic, 5))
A <b>bol...</b>
>>> print(tc(bold_and_italic, 14))
A <b>bold</b> and <i>ita...</i>
>>> print(tc(lorem_ipsum, 30))
Lorem ipsum dolor sit amet, co...
>>> print(tc('<p>Lorem <b>ipsum</b> dolor sit amet, consectetur adipiscing elit.</p>', 30))
Lorem <b>ipsum</b> dolor sit amet, co...
>>> print(tc('<p>Lorem <b>ipsum</b> dolor sit amet, consectetur adipiscing elit.</p>', 10))
Lorem <b>ipsu...</b>
>>> print(tc('<p>Lorem ipsum dolor sit amet</p><p>consectetur adipiscing elit.</p>', 30))
Lorem ipsum dolor sit amet
cons...
>>> tc("<p>A plain paragraph with more than 20 characters.</p>", 20)
'A plain paragraph wi...'
Multiple paragraphs are summarized:
>>> tc("<p>aaaa.</p><p>bbbb.</p><p>cccc.</p><p>dddd.</p><p>eeee.</p>", 20)
'aaaa.\n\nbbbb.\n\ncccc.\n\nd...'
>>> tc("<div>{}</div>".format(lorem_ipsum), 20)
'Lorem ipsum dolor si...'
TODO¶
TODO: the following snippet is skipped because #5039 is not yet fixed. HTML tags that are escaped in the source text must remain escaped in the result.
>>> print(tc("""<p>foo <html><head><base href="bar" target="_blank"></head><body></p><p>baz</p>"""))
...
foo <html><head><base href="bar" target="_blank"></head><body>
baz