{html2text:html}

Description

Converts an HTML fragment to plain text, the way AA builds the text alternative of an HTML e-mail. It decodes HTML entities (and nbsp to a space), turns each link into its visible text followed by the URL in curly braces, maps block tags to line breaks (br, closing p, closing h1-h9, closing tr), turns each li into a bulleted line, draws hr as a row of dashes, removes script, style and head blocks entirely, then strips every remaining tag and tidies whitespace. One important trap: the argument is read only up to the first unescaped colon, so any colon in the HTML (a URL, a style value, a time) must be written as #: to survive; otherwise the text is silently cut at the colon. The result is never cached and the input is not trimmed.

Parameters

html required

The HTML fragment to convert to plain text. Tags become line breaks or are stripped, entities are decoded, and each link is rewritten as its text plus the URL in curly braces. Read only up to the first unescaped colon: write any literal colon as #: (for example a URL host as https#://) so the whole fragment reaches the converter.

Examples

virtual{html2text:<h1>News</h1><p>First.</p><p>Second.</p>}
Expected(News, blank line, First., blank line, Second.)
ActualNews First. Second.
A closing heading or paragraph becomes a blank-line break (br becomes a single break). Multi-line output is illustrative.
test{html2text:Tom &amp; Jerry &lt;clip&gt;}
ExpectedTom & Jerry
ActualTom & Jerry
Named and numeric HTML entities are decoded (amp becomes the ampersand, lt/gt become the angle brackets, which are then treated as a tag and removed).
test{html2text:Closing time 17#:00 sharp}
ExpectedClosing time 17:00 sharp
ActualClosing time 17:00 sharp
Writing the colon as #: lets the whole string reach the converter.
test{html2text:<a href="https#://apc.org">Our site</a>}
ExpectedOur site {https://apc.org}
ActualOur site {https://apc.org}
A link is rewritten as its visible text, then the URL in curly braces. The colon in the URL is written as #: so it is not cut at the first colon.
test{html2text:<a href="https#://apc.org">https#://apc.org</a>}
Expectedhttps://apc.org
Actualhttps://apc.org
When the link text is identical to the href, only the URL is kept (no duplicated braces).
virtual{html2text:<ul><li>Milk<li>Bread</ul>}
Expected(newline, then " * Milk", newline, then " * Bread")
Actual * Milk * Bread
Each li opens a new bulleted line (" * "). The output spans several lines, so it is shown for illustration rather than asserted as a single-line test.
test{html2text:price&nbsp;100}
Expectedprice 100
Actualprice 100
A non-breaking space (nbsp, ensp, emsp, thinsp) is turned into an ordinary space.
test{html2text:Plain text stays as is}
ExpectedPlain text stays as is
ActualPlain text stays as is
Text with no HTML markup passes through unchanged.
test{html2text:before<script>alert(1)</script>after}
Expectedbeforeafter
Actualbeforeafter
Whole script, style and head blocks are removed, content and all.
test{html2text:<strong>Bold</strong> and <em>italic</em> words}
ExpectedBold and italic words
ActualBold and italic words
Inline formatting tags are removed; their text content remains.
test{html2text:Closing time 17:00 sharp}
ExpectedClosing time 17
ActualClosing time 17
The argument is read only up to the first unescaped colon, so everything from the colon on is dropped. Compare with the next example.