Converting XHTML To Text-Only Version Using ColdFusion And XSLT

Posted May 20, 2009 at 7:13 PM by Ben Nadel

Tags: ColdFusion

The other day, I was having a discussion about sending emails using ColdFusion. At one point, the conversation turned to email format. To me, in this day an age, it seems silly to even worry about text-only versions of emails. I mean really - are there even any clients anymore that can't handle HTML formatting? I think even BlackBerrys can handle HTML formatted emails. As such, I generally have no problem building apps that only send out HTML versions.

But, I did think it would be a fun exercise to come up with a way to take XHTML content for emails and automatically convert it into a text-only version. I really love writing and working with XML and it just seemed that XML Transformations using XSLT would be the right tool for the job. The following demo is what I came up with after a little bit of trial and error. I'm no XSLT expert (far from it), so it's not perfect. But, considering that this is automatically created, "just in case" content, I think it's pretty good:

  • <!--- Save HTML content. --->
  • <cfsavecontent variable="strHTML">
  •  
  • <h1>
  • Thank you for your purchase!
  • </h1>
  •  
  • <p>
  • Invoice number: <strong>12345</strong><br />
  • Price: <strong>$19.95</strong>
  • </p>
  •  
  • <hr />
  •  
  • <h2>
  • Purchased Products
  • </h2>
  •  
  • <table cellspacing="5" border="1">
  • <tr>
  • <td>
  • Muscle Girls Gone Wild
  • </td>
  • <td>
  • $10.95
  • </td>
  • </tr>
  • <tr>
  • <td>
  • Female Muscle - The Definitive Guide
  • </td>
  • <td>
  • $9.00
  • </td>
  • </tr>
  • </table>
  •  
  • <hr />
  •  
  • <p>
  • If you have any questions about your order please
  • contact us at
  • <a href="mailto:orders@amazon.com">orders@amazon.com</a>.
  • </p>
  •  
  • </cfsavecontent>
  •  
  •  
  • <!--- Define the XSLT --->
  • <cfsavecontent variable="strXSLT">
  •  
  • <?xml version="1.0" encoding="ISO-8859-1"?>
  • <xsl:transform
  • version="1.0"
  • xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
  •  
  • <!--- Store variable for new line. --->
  • <xsl:variable
  • name="new-line"
  • select="'&#10;'"
  • />
  •  
  • <!--- Store variable for double-new line. --->
  • <xsl:variable
  • name="new-lines"
  • select="concat( $new-line, $new-line )"
  • />
  •  
  •  
  • <!---
  • Match the root node plus any nodes that are not
  • matched specifically by the templates defined
  • below.
  • --->
  • <xsl:template match="*">
  • <xsl:apply-templates select="text()|*" />
  • </xsl:template>
  •  
  • <!--- For all text nodes, output trimmed value. --->
  • <xsl:template match="text()">
  • <xsl:value-of select="normalize-space( . )" />
  • </xsl:template>
  •  
  • <!--- Denote primary header with hrule. --->
  • <xsl:template match="h1">
  • <xsl:apply-templates select="text()|*" />
  • <xsl:value-of select="$new-line" />
  • <xsl:text>---------------------------------</xsl:text>
  • <xsl:value-of select="$new-lines" />
  • </xsl:template>
  •  
  • <!--- Denote secondary headers with hash marks. --->
  • <xsl:template match="h2|h3|h4|h5">
  • <xsl:text>## </xsl:text>
  • <xsl:apply-templates select="text()|*" />
  • <xsl:value-of select="$new-lines" />
  • </xsl:template>
  •  
  • <!--- Turn block level elements into text-only. --->
  • <xsl:template match="p|blockquote|li">
  • <xsl:apply-templates select="text()|*" />
  • <xsl:value-of select="$new-lines" />
  • </xsl:template>
  •  
  • <!--- Add new line after table. --->
  • <xsl:template match="table">
  • <xsl:apply-templates select="*" />
  • <xsl:value-of select="$new-line" />
  • </xsl:template>
  •  
  • <!--- Turn table rows into bracketed values. --->
  • <xsl:template match="tr">
  • <xsl:apply-templates select="*" />
  • <xsl:value-of select="$new-line" />
  • </xsl:template>
  •  
  • <!--- Bracket table values. --->
  • <xsl:template match="td">
  • <xsl:value-of select="'[ '" />
  • <xsl:apply-templates select="text()|*" />
  • <xsl:value-of select="' ]'" />
  • </xsl:template>
  •  
  • <!---
  • Strip out any inline tags (and start them off with
  • an initial space so that nested and sibling tags don't
  • get concatenated text).
  • --->
  • <xsl:template match="strong|em|span|a">
  • <xsl:text> </xsl:text>
  • <xsl:value-of select="text()" />
  • </xsl:template>
  •  
  • <!---
  • Replace hrule with manual dashes.
  • NOTE: template also named for manual execution.
  • --->
  • <xsl:template match="hr" name="hr">
  • <xsl:text>. . . . . . . . . . . . . . . . .</xsl:text>
  • <xsl:value-of select="$new-lines" />
  • </xsl:template>
  •  
  • <!--- Replace break tag with new line. --->
  • <xsl:template match="br">
  • <xsl:value-of select="$new-line" />
  • </xsl:template>
  •  
  • </xsl:transform>
  •  
  • </cfsavecontent>
  •  
  •  
  • <!---
  • Convert to the HTML to text only. As we are doing this,
  • we need to wrap the HTML in a root node so that the XML
  • document we parse is well formatted.
  • --->
  • <cfset strTextOnly = XmlTransform(
  • ("<data>" & strHTML & "</data>"),
  • Trim( strXSLT )
  • ) />
  •  
  • <!--- Strip out doc type. --->
  • <cfset strTextOnly = Trim(
  • REReplace(
  • strTextOnly,
  • "<[^>]*>",
  • "",
  • "one"
  • )
  • ) />
  •  
  •  
  • <!--- Output the text-only verson. --->
  • <cfset WriteOutput( strTextOnly ) />

As you can see, the HTML would need to be stored in some sort of content buffer and it would have to be XHTML compliant such that it could be parsed using XmlParse(). My HTML content doesn't happen to have any special characters (ex: ampersand); but, if it did, I assume they would have to be escaped prior to XML parsing. Once the XHTML is parsed, I then use ColdFusion's XmlTransform() and the given XSLT document to create the following output (copied from rendered page source):

Thank you for your purchase!
---------------------------------

Invoice number: 12345
Price: $19.95

. . . . . . . . . . . . . . . . .

## Purchased Products

[ Muscle Girls Gone Wild ][ $10.95 ]
[ Female Muscle - The Definitive Guide ][ $9.00 ]

. . . . . . . . . . . . . . . . .

If you have any questions about your order please contact us at orders@amazon.com.

For an automated process, I think that's pretty cool! I'm not sure I would even bother putting this into an application; but, if I needed to, it's nice to see that automatically converting HTML email content into text-only content is a rather straightforward task.



Reader Comments

May 21, 2009 at 11:40 AM // reply »
40 Comments

This will be useful for sending email newsletters. Thanks Ben.


May 21, 2009 at 11:49 AM // reply »
11,314 Comments

@Brian,

My thoughts exactly.


May 21, 2009 at 1:11 PM // reply »
23 Comments

Interesting post. Speaking of which, when you have a free moment, will you be so kind as to answering the email I just sent you.

Thank you kindly.

Have a great day.


May 21, 2009 at 5:32 PM // reply »
3 Comments

For a comprehensive XSL template to convert XHTML to Text you use the screen reader template that comes with XStandard as a good starting point.

Email newsletters - I use XSLT to transform XHTML to plain text for our newsletter module. It works nicely.


May 22, 2009 at 12:17 PM // reply »
23 Comments

Hey Ben-

I know you are a way busy guy, but you think maybe you can answer my question as soon as you get a chance. I dont mean to be a pain but its really important.

Thanks, and sorry to bother.

Have a happy holiday weekend.


May 25, 2009 at 10:29 AM // reply »
11,314 Comments

@Johans,

I didn't know that XStandard came with a screen reader XSLT. Aswesome!


May 26, 2009 at 8:34 AM // reply »
56 Comments

Ben!

The problem isn't so much if e-mail clients can or can't read HTML-based messages, it's the darn rendering engines running within the email clients or webmail clients that make it a horrendous effort to create equal looking e-mails ;-)

Luckily there's http://www.email-standards.org/ ;-)


May 27, 2009 at 8:45 AM // reply »
11,314 Comments

@Sebastiaan,

Email standards is good, but depressing :)


Jun 10, 2009 at 10:23 AM // reply »
23 Comments

Hey Ben-

Just wanted to say thanks for answering my questions. Good answer. As always, it was really great talking to you.

Hope to see you again real soon ;)!


Aug 8, 2009 at 9:26 AM // reply »
1 Comments

mein gott! you forget that many people prefer to turn off html email rendering, not merely to avoid their bandwidth sucking aquaintances spamming them with twinkling eFun, but so that slimy requests for evil embedded links can't be used by noxious, pestulant spammers to verify their manky catalogues of addresses and domains.

gah.


May 9, 2011 at 5:11 PM // reply »
1 Comments

First, Ben I can't tell you how much time you've saved me over the years. Thanks so much!

Your article on Converting XHTML To Text-Only... really solved a problem I had today.

Just one more question on this. My HTML has entities like this: &rsquo; but the SAX xmltransform engine doesn't like this.

I can replace the entities I know about in any given HTML, but is there a more universal solution.


May 9, 2011 at 10:54 PM // reply »
11,314 Comments

@Clarke,

I'm glad to know that this has helped. Your question about the & is a most excellent one. In fact, I think I know where the answer is; but I have never looked into it before. On a different post, Eric Stevens said something about creating an XML entity to work with these kinds of values:

Also, &reg; isn't self referential in our code like I said in my last paragraph, it's actually this:
< !ENTITY reg "< sup >&#174;</ sup >">

Now, granted, I don't know what that means exactly, but I think it means that you can define the special meaning of "&" in XML to stand for HTML elements.

Let me look into this in the AM - it is something I am very curious in as well. Thanks for reminding me about this.


May 10, 2011 at 9:56 AM // reply »
11,314 Comments

@Clarke,

Take a look at this:

http://www.bennadel.com/blog/2191-My-First-Look-At-The-XML-ENTITY-Tag-In-ColdFusion-XML-Documents.htm

I did some light exploration of the XML ENTITY tag for declaring these kinds of named-HTML elements.


May 10, 2011 at 2:06 PM // reply »
1 Comments

Thanks for the great tips, Ben.

With your help, the fix I found was to add

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">

to the top of the HTML file. I tried this before I wrote my previous comment, but I forgot that the actual HTML content has to be wrapped in <data> tags. I ended up wrapping the DOCTYPE inside the <data> tags which, of course, won't work.

One more thing, I think I found a bug in this code:

# <xsl:template match="strong|em|span|a">
# <xsl:text> </xsl:text>
# <xsl:value-of select="text()" />
# </xsl:template>

The 3rd line should be:
<xsl:value-of select="text()|*" />

The |* was missing.



Post A Comment

Comment Etiquette: Please do not post spam. Please keep the comments on-topic. Please do not post unrelated questions or large chunks of code. And, above all, please be nice to each other - we're trying to have a good conversation here.

Please review the following issues:

Author Name:


Author Email:

Author Website:

Comment:

Supported HTML tags for formatting: <strong>bold</strong>   <em>italic</em>   <code>code</code>







  • Help Wanted - Find Your Next ColdFusion Job
Ben Nadel's Company - Epicenter Consulting Recent Blog Comments
Jun 18, 2013 at 9:20 PM
Mapping AngularJS Routes Onto URL Parameters And Client-Side Events
I couldn't find examples of passing multiple arguments using the when() routing statement so figured out through trial and error that you can pass multiple arguments using the following format: .whe ... read »
Jun 18, 2013 at 3:39 PM
Experimenting With The Amazon Simple Storage Service (S3) API Using ColdFusion
Hi Ben, THANKS! While not bleeding edge, it is new to me & I like learning new things every day! ... read »
Jun 18, 2013 at 12:30 PM
Disabling Auto-Correct And Auto-Capitalize Features On iPhone Inputs
Also spellcheck="false" should be mentioned as part of html5 specs ... read »
Jun 18, 2013 at 8:40 AM
Using Named Functions Within Self-Executing Function Blocks In Javascript
Hi Ben, you forgot to mention the most important thing for named self-executing functions - they can be referenced by name ONLY inside their execution context (which is parens in this case), it mean ... read »
dee
Jun 18, 2013 at 7:01 AM
My Safari Browser SQLite Database Hello World Example
hai ben, this program is really good i could understand the concept but i dint know how to save it and how to open it as you have done in the video can u give that details pls ... read »
Jun 18, 2013 at 6:04 AM
Clearing Inline CSS Properties With jQuery
Thanks a lot for for post! It helped me a lot... after being stuck since 24 hrs.. found solution from your post. Thanks again! ... read »
Jun 18, 2013 at 2:31 AM
SOTR 2013 - The Best Conference I Never Went To
I keep watching it, should keep me happily distracted until SotR14 ;) ... read »
Jun 17, 2013 at 9:45 PM
What If All User Interface (UI) Data Came In Reports?
@Jonah, As I was reading what you wrote, it occurred to me that maybe I do something similar to that in some of my client-side code. In an application I'm working on, there are a bunch of unrelated ... read »
InVision App - Prototyping Made Beautiful With Prototyping Tools