Welcome | Get started | Dive | Contribute | Topics | Reference | Changes | More
Truncating HTML texts¶
This document is about the truncate_comment
function,
the purpose of which is to summarize a HTML text into a single paragraph.
The function was reimplemented in July 2023, triggered by #5039
(Comment with a <base>
tag caused Jane to break).
Both the old and the new implementation use BeautifulSoup.
Side note: Code snippets (lines starting with >>>
) in this document get
tested as part of our development workflow. The following
initialization snippet tells you which demo project is being used in
this document.
>>> from lino import startup
>>> startup('lino_book.projects.noi1e.settings.demo')
>>> from lino.api.doctest import *
>>> from lino.modlib.memo.mixins import truncate_comment as tc
Examples¶
>>> bold_and_italic = "<p>A <b>bold</b> and <i>italic</i> thing."
>>> lorem_ipsum = '<p>Lorem ipsum dolor sit amet, consectetur adipiscing elit.</p>'
>>> print(tc('<h1 style="color: #5e9ca0;">Styled comment <span style="color: #2b2301;">pasted from word!</span> </h1>'))
...
Styled comment pasted from word!
>>> print(tc('<img src="foo" alt="bar"/></p>'))
<img alt="bar" src="foo" style="float:right;height:8em"/>
>>> dd.plugins.memo.short_preview_image_height
'8em'
>>> print(tc('<p>A short paragraph</p><p><ul><li>first</li><li>second</li></ul></p>'))
A short paragraph
first
second
>>> settings.SITE.plugins.memo.short_preview_length
300
>>> html = '<p>Ich habe Hirn, ich will hier raus! – Wie im Netz der Flachsinn regiert.</p>\n<ul>\n<li>Veröffentlicht: 6. Mai 2017</li>\n<li>Vorgestellt in: <a href="https://www.linkedin.com/pulse/feed/channel/deutsch"><span>Favoriten der Redaktion</span></a>, <a href="https://www.linkedin.com/pulse/feed/channel/jobs"><span>Job & Karriere</span></a>, <a href="https://www.linkedin.com/pulse/feed/channel/verkauf"><span>Marketing & Verkauf</span></a>, <a href="https://www.linkedin.com/pulse/feed/channel/technologie"><span>Technologie & Internet</span></a>, <a href="https://www.linkedin.com/pulse/feed/channel/wochenendLekture"><span>Wochenend-Lektüre</span></a></li>\n</ul>\n<ul>\n<li><span><span>Gefällt mir</span></span><span>Ich habe Hirn, ich will hier raus! – Wie im Netz der Flachsinn regiert</span>\n<p> </p>\n<a href="https://www.linkedin.com/pulse/ich-habe-hirn-hier-raus-wie-im-netz-der-flachsinn-regiert-dueck"><span>806</span></a></li>\n<li><span>Kommentar</span>\n<p> </p>\n<a href="https://www.linkedin.com/pulse/ich-habe-hirn-hier-raus-wie-im-netz-der-flachsinn-regiert-dueck#comments"><span>42</span></a></li>\n<li><span>Teilen</span><span>Ich habe Hirn, ich will hier raus! – Wie im Netz der Flachsinn regiert teilen</span>\n<p> </p>\n<span>131</span></li>\n</ul>\n<p><a href="https://www.linkedin.com/in/gunterdueck"><span>Gunter Dueck</span></a> <span>Folgen</span><span>Gunter Dueck</span> Philosopher, Writer, Keynote Speaker</p>\n<p>Das Smartphone vibriert, klingelt oder surrt. Zing! Das ist der Messenger. Eine Melodie von eBay zeigt an, dass eine Auktion in den nächsten Minuten endet. Freunde schicken Fotos, News versprechen uns "Drei Minuten, nach denen du bestimmt lange weinen musst" oder "Wenn du dieses Bild siehst, wird sich dein Leben auf der Stelle für immer verändern".</p>\n<p>Politiker betreiben statt ihrer eigentlichen Arbeit nun simples Selbstmarketing und fordern uns auf, mal schnell unser Verhalten zu ändern – am besten natürlich "langfristig" und "nachhaltig". Manager fordern harsch immer mehr Extrameilen von uns ein, die alle ihre (!) Probleme beseitigen, und es gibt für jede Schieflage in unserem Leben Rat von allerlei Coaches und Therapeuten, es gibt Heilslehren und Globuli.</p>'
>>> print(tc(html))
Ich habe Hirn, ich will hier raus! – Wie im Netz der Flachsinn regiert.
Veröffentlicht: 6. Mai 2017
Vorgestellt in: <a href="https://www.linkedin.com/pulse/feed/channel/deutsch"><span>Favoriten der Redaktion</span></a>, <a href="https://www.linkedin.com/pulse/feed/channel/jobs"><span>Job & Karriere</span></a>, <a href="https://www.linkedin.com/pulse/feed/channel/verkauf"><span>Marketing & Verkauf</span></a>, <a href="https://www.linkedin.com/pulse/feed/channel/technologie"><span>Technologie & Internet</span></a>, <a href="https://www.linkedin.com/pulse/feed/channel/wochenendLekture"><span>Wochenend-Lektüre</span></a>
Gefällt mir Ich habe Hirn, ich will hier raus! – Wie im Netz der Flachs...
>>> print(tc('Some plain text.'))
Some plain text.
>>> print(tc('Two paragraphs of plain text.\n\n\nHere is the second paragraph.'))
Two paragraphs of plain text.
Here is the second paragraph.
BeautifulSoup¶
>>> def walk(ch, indent=0):
... prefix = " " * indent
... if hasattr(ch, 'tag'):
... print(prefix + str(type(ch)) + " " + ch.name + ":")
... for c in ch.children:
... walk(c, indent+2)
... else:
... print(prefix + str(type(ch)) + " " + repr(ch.string))
... # print(prefix+repr(ch.string))
>>> soup = BeautifulSoup(bold_and_italic, "html.parser")
>>> walk(soup)
...
<class 'bs4.BeautifulSoup'> [document]:
<class 'bs4.element.Tag'> p:
<class 'bs4.element.NavigableString'> 'A '
<class 'bs4.element.Tag'> b:
<class 'bs4.element.NavigableString'> 'bold'
<class 'bs4.element.NavigableString'> ' and '
<class 'bs4.element.Tag'> i:
<class 'bs4.element.NavigableString'> 'italic'
<class 'bs4.element.NavigableString'> ' thing.'
>>> soup = BeautifulSoup(lorem_ipsum, "html.parser")
>>> walk(soup)
...
<class 'bs4.BeautifulSoup'> [document]:
<class 'bs4.element.Tag'> p:
<class 'bs4.element.NavigableString'> 'Lorem ipsum dolor sit amet, consectetur adipiscing elit.'
Truncation¶
>>> print(tc(bold_and_italic))
A <b>bold</b> and <i>italic</i> thing.
>>> print(tc(bold_and_italic, 5))
A <b>bol...</b>
>>> print(tc(bold_and_italic, 14))
A <b>bold</b> and <i>ita...</i>
>>> print(tc(lorem_ipsum, 30))
Lorem ipsum dolor sit amet, co...
>>> print(tc('<p>Lorem <b>ipsum</b> dolor sit amet, consectetur adipiscing elit.</p>', 30))
Lorem <b>ipsum</b> dolor sit amet, co...
>>> print(tc('<p>Lorem <b>ipsum</b> dolor sit amet, consectetur adipiscing elit.</p>', 10))
Lorem <b>ipsu...</b>
>>> print(tc('<p>Lorem ipsum dolor sit amet</p><p>consectetur adipiscing elit.</p>', 30))
Lorem ipsum dolor sit amet
cons...
>>> tc("<p>A plain paragraph with more than 20 characters.</p>", 20)
'A plain paragraph wi...'
Multiple paragraphs are summarized:
>>> tc("<p>aaaa.</p><p>bbbb.</p><p>cccc.</p><p>dddd.</p><p>eeee.</p>", 20)
'aaaa.\n\nbbbb.\n\ncccc.\n\nd...'
>>> tc("<div>{}</div>".format(lorem_ipsum), 20)
'Lorem ipsum dolor si...'
Sanitizing¶
The truncate_comment
function
also does basic sanitizing.
>>> print(tc("""<p>foo <html><head><base href="bar" target="_blank"></head><body></p><p>baz</p>"""))
...
foo
baz
Let’s try to truncate a whole HTML page:
>>> html_str = """
... <!doctype html><html lang="en">
... <head><title>Bad Request (400)</title></head>
... <body>
... <h1>Bad Request (400)</h1>
... <p></p>
... </body>
... </html>"""
>>> print(tc(html_str))
html
<title>Bad Request (400)</title>
Bad Request (400)
TODO¶
TODO: the following snippet is skipped because #5039 is not yet fixed. HTML tags that are escaped in the source text must remain escaped in the result.
>>> print(tc("""<p>foo <html><head><base href="bar" target="_blank"></head><body></p><p>baz</p>"""))
...
foo <html><head><base href="bar" target="_blank"></head><body>
baz