Ask Ben: Closing XHTML Tags In A Truncated Message

By Ben Nadel

Published 2007-10-01 in Ask Ben, ColdFusion — Comments (10)

Hey Ben, I have an internal developers forum, and when folks reply to messages I quote their original text (which contains html codes). If this text hits a certain threshold I chop the text at that point using this CF code/regex so I don't chop in the middle of a word: [code]

Before I run that bit of code I manually ReReplace (to remove) all of the possible tags folks can enter in a bunch of regex statements. The main problem was that if someone bolds, italicizes, underlines or even changes a font color - if that tag were left open, would colorize the rest of the messages in the forum following the originally cut message.

I've always wanted to do this in one regular expression. Perhaps I could back reference all of the previous found tags. Then close them in reverse order of where they were found after the chopping point, but I'm unsure this is possible? What do you think?

I don't think that this can be done with a single regular expression. And, since I have been doing a lot of work with the Java Pattern / Matcher lately, I am sure I am seeing it a solution that needs a problem, and therefore am trying to fit it into more places than it should go. That being said, I think the Java Pattern Matcher is going to be the most straight forward way of finding out which tags have been left open.

The idea here is that we are going to loop over the message and copy all the open tags into a tag stack. As we find tags that are closing tags, we can then pop one off of that stack in such a way that after we are done looping, the only tags left in the stack should be the ones that were not successfully closed. Self-closing tags can be ignored as the close themselves and cannot be left open.

To start off, let's simulate a forum message that contains unclosed tags:

<!---
	Save text that contains unclosed HTML tags. In this case,
	we are leaving all three tags (P, STRONG, EM) open.
--->
<cfsavecontent variable="strMessage">

	<p>
		Cassandra, I think this Hoops guys sounds like he's
		really into you. Sure, maybe he lied to you about
		being a basketball player, but he's got that goofy
		charm I just <strong><em>know you love

</cfsavecontent>

Notice here that we are leaving the P, STRONG, and EM tags opened. These are the three tags that we hope to collect in out pattern matching and then close at the end. Notice here that our assumption is that the message is fully truncated. That is, that people didn't CLOSE the paragraph tag, but leave open the EM tag. If this is not the case, the algorithm will still work, but will not produce XHTML valid code.

Ok, let's take a look at the code:

<!---
	Create an array to grab all of the open tags that
	need to be closed.
--->
<cfset arrOpenTags = ArrayNew( 1 ) />

<!---
	Create a pattern to match HTML tags. NOTE: This is not
	a complete HTML tag matching regular expression (and does
	not take into account attribute values with greater than
	signs... but for our purposes it will due. We are going
	to capture the closing slash and self-closing slash so
	that we can easily tell what kind of tag we have.
--->
<cfset objPattern = CreateObject(
	"java",
	"java.util.regex.Pattern"
	).Compile(
		JavaCast( "string", "<(/)?([a-z]+)[^>]*(/)?>" )
		)
	/>

<!--- Grab the pattern matcher for our target text. --->
<cfset objMatcher = objPattern.Matcher(
	JavaCast( "string", strMessage )
	) />

<!---
	Now, we want to loop over the message collecting tags. For
	each tag that we encounter, if its a self-closing tag we
	want to ignore it. If it's an opening tag, we want to add
	it to the stack and if its a closing tag, we want to pop
	one tag off of the stack - Assuming valid XHTML, each close
	tag should correspond to the TOP tag on the stack.
--->
<cfloop condition="objMatcher.Find()">

	<!--- Grab the close slash. --->
	<cfset REQUEST.Close = objMatcher.Group(
		JavaCast( "int", 1 )
		) />

	<!--- Grab the tag name. --->
	<cfset REQUEST.Tag = objMatcher.Group(
		JavaCast( "int", 2 )
		) />

	<!--- Grab the self-close slash. --->
	<cfset REQUEST.SelfClose = objMatcher.Group(
		JavaCast( "int", 3 )
		) />


	<!---
		Since the two slashes are optional groups, they might
		not exist. Therefore, we need to check to see if their
		NULLness destroyed the variable in order to check for
		matching.
	--->
	<cfif StructKeyExists( REQUEST, "SelfClose" )>

		<!---
			Self closing tags close themselves, so we don't
			to worry about them.
		--->

	<cfelseif StructKeyExists( REQUEST, "Close" )>

		<!---
			This is a closing tag that, given properly
			nested and valid XHTML, should correspond to the
			tag on the top of the stack (bottom of our array).
			Therefore, pop the tag off of the bottom.
		--->
		<cfset ArrayDeleteAt(
			arrOpenTags,
			ArrayLen( arrOpenTags )
			) />

	<cfelse>

		<!---
			This is an open tag, so push in on to the top of
			the stack (bottom of our array).
		--->
		<cfset ArrayAppend(
			arrOpenTags,
			REQUEST.Tag
			) />

	</cfif>

</cfloop>


<!---
	This this point, we have collected all the unopenned
	tags in our stack. Now, all we have to do is loop over
	the array (backwards) and close the tags in that order.
--->
<cfloop
	index="intTagIndex"
	from="#ArrayLen( arrOpenTags )#"
	to="1"
	step="-1">

	<!--- Add the closing tag to the message. --->
	<cfset strMessage = (
		Trim( strMessage ) &
		"</" &
		arrOpenTags[ intTagIndex ] &
		">"
		) />

</cfloop>


<!--- Output updated message. --->
#strMessage#

Running the above code, the new message XHTML contains this:

<p>
	Cassandra, I think this Hoops guys sounds like he's
	really into you. Sure, maybe he <em>lied</em> to you
	about being a basketball player, but he's got that
	goofy charm I just <strong><em>know you love</em></strong></p>

Notice that the EM, STRONG, and P tags were closed in the reverse order in which they were found.

I know that this solution is probably a lot more involved and complicated than you were hoping it would be. And, this is a common problem, so it's entirely possible that there is a much shorter, sexier solution out there. But, if nothing else, hopefully this can point you in a good direction.

Want to use code from this post? Check out the license.

Short link: https://bennadel.com/982

Reader Comments

Jeff Oct 1, 2007 at 3:37 PM

5 Comments

This is a very cool solution to a problem I've been wrestling with lately as well. I came up with something that ended up in basically the same place, only not as nicely and I'm sure not as efficiently.

So the problems that remains for me are:
1. A tinyMCE plugin is inserting invalid XHTML img tags (no end slash). I'll probably just have to hack the plugin to fix this.

2. What if the truncated text ends in the middle of a tag? For example, in the text I'm chopping, there may be an img tag and they end up being long enough that chances are pretty good I'm going to end up somewhere in the middle of that tag. That means I now have to figure out if there is an unclosed tag ending the snippet and, if so, remove everything after the "<".

3. If I cut the text at, say, 100 characters and there are more than a couple of html tags in there, I end up with very little text and my snippet ends up being only a word or two. I need a way to take the first n characters which are not part of an html tag. Does that make sense?

This is indeed an icky problem that I'm sure has been solved many times over, but I'm just not finding that complete and elegant solution. Thanks for pointing me down at least a cleaner path.

Ben Nadel Oct 1, 2007 at 3:48 PM

16,238 Comments

@Jeff,

I think these need to be handles separately.

1: The CFLoop can be updated to treat IMG tags as the same as a self-closing tag.

2: The truncation can be updated so that it doesn't go mid tag... or after the truncation takes place, we can clean out any non-closing tags.

3: Not sure what is the best solution for this. Hmmm.

Jason Oct 1, 2007 at 5:29 PM

142 Comments

@Jeff,

Just a note, not certain of this, but if I recall correctly the latest version of tinyMCE has a parameter which can be configured to "turn on" XHTML adherence in its content. May solve your first problem.

-J-

Steven Levithan Oct 1, 2007 at 6:20 PM

172 Comments

@Ben, a couple suggestions:

- You can use a negative lookbehind at the end of the tag to skip over self-closed elements, so you don't have to deal with them at all. The same goes for singleton elements like img, via a negative lookahead at the start of the tag.
- Since some HTML element names contain numbers (at least the h1 - h6 elements), the "Tag" group should probably allow for that.

So, the regex could end up as something like <(/)?(?!img|[hb]r)([a-z][a-z\d]*)[^>]*(?<!/)> with the CASE_INSENSITIVE modifier.

@Jeff, regarding item 2, since Java doesn't support regex conditionals (which would make this a little cleaner), you could use something like ^[\S\s]{1,100}(?:(?<=<[^>]{0,99})[^>]*>)? to get the first 100 characters, unless it ends in the middle of a tag, in which case it will grab up to the end of the tag. The {0,99} quantifier is used instead of * because Java doesn't support unbounded quantification in lookbehind.

Regarding item 3, while this wouldn't be perfect (it would treat HTML tags as one character instead of zero, as it should), you could use something simple like ^(?:[^<]|<[^>]+>){1,100} to improve the situation a bit.

Jeff Oct 1, 2007 at 8:01 PM

5 Comments

@Jason, thanks. I'll look into your tinyMCE tip. That'd be nifty.

@Steven, frankly, I'm kind of deer-in-headlights stunned at the beauty and complexity of your regex's. I had convinced myself that what you've got there was not possible last weekend. In fact, ironically, after a couple of hours of reading through stuff on your blog, I started an email to you to ask the same question, but by the end of the message I had, as I said, convinced myself it was not possible, so I trashed the email. Now, to see if I can actually work this into my CF code and make it work.

@Ben, thanks for your suggestions and for taking on this topic in the first place. Incredibly timely for me.

Steven Levithan Oct 2, 2007 at 1:10 PM

172 Comments

@Jeff, thanks!

BTW, here's a quick and simple but complete implementation in JavaScript which solves your issues numbered 1, 2, and 3 (it should be pretty straightforward to convert it to ColdFusion):

----------
var maxChrs = 100,
    re = new RegExp("[^<]{1," + maxChrs + "}|(<(/)?(\\w+)[^>]*(/)?>)", "g"),
    match,
    output = "",
    chrCount = 0,
    openTags = [],
    selfClosingTag = /^(?:img|[hb]r)$/;

while ((match = re.exec(str)) && (chrCount < maxChrs)) {
    // If this is an HTML tag
    if (match[1]) {
        output += match[0];
        // If this is not a self-closing tag
        if (!(match[4] || selfClosingTag.test(match[3]))) {
            // If this is a closing tag
            if (match[2]) {
                openTags.pop();
            } else {
                openTags.push(match[3]);
            }
        }
    } else {
        output += match[0].substring(0, maxChrs - chrCount);
        chrCount += match[0].length;
    }
}

for (var i = openTags.length - 1; i >= 0; i--) {
output += "</" + openTags[i] + ">";
}
----------

The input string is expected to be named str. maxChrs indicates how many characters outside of HTML tags you want to return (while ensuring that the string never ends in the middle of a tag, and all tags get closed). The above intentionally avoids looping over the string one character at a time (while counting the characters outside of HTML tags), for efficiency reasons. If IE supported the indexOf method for arrays from JavaScript 1.6, I would make selfClosingTag an array of strings instead of a regex.

Steven Levithan Oct 3, 2007 at 12:27 AM

172 Comments

I just posted a modified version of the above (which also avoids ending the string in the middle of words) at http://blog.stevenlevithan.com/archives/get-html-summary

trifide Apr 9, 2009 at 3:43 AM

1 Comments

Is there a cfm version of getLeadingHtml javascript function by Steven?
<code>
function getLeadingHtml (input, maxChars) {
// token matches a word, tag, or special character
var token = /\w+|[^\w<]|<(\/)?(\w+)[^>]*(\/)?>|</g,
selfClosingTag = /^(?:[hb]r|img)$/i,
output = "",
charCount = 0,
openTags = [],
match;

// Set the default for the max number of characters
// (only counts characters outside of HTML tags)
maxChars = maxChars || 250;

while ((charCount < maxChars) && (match = token.exec(input))) {
// If this is an HTML tag
if (match[2]) {
output += match[0];
// If this is not a self-closing tag
if (!(match[3] || selfClosingTag.test(match[2]))) {
// If this is a closing tag
if (match[1]) openTags.pop();
else openTags.push(match[2]);
}
} else {
charCount += match[0].length;
if (charCount <= maxChars) output += match[0];
}
}

// Close any tags which were left open
var i = openTags.length;
while (i--) output += "</" + openTags[i] + ">";

return output;
};
</code>

Max Paperno Jan 26, 2010 at 10:34 AM

1 Comments

I re-wrote Steve's JS code as a CFM function, with a minor improvement (better tag auto-closing logic). Many thanks to Steve! That regex solution is brilliant, really helped me out.

Just posted to cflib.org as "getLeadingHTML" -- should be up soon (tried pasting here but I'm guessing it was too long).

Cheers,

-Max

Ben Nadel Jan 28, 2010 at 10:42 PM

16,238 Comments

@Max,

Yeah, Steve is a beast :) If you haven't already, I would highly recommend his O'Reilly book - Regular Expression Cookbook.

Oh my chickens, this post is old!

Hit me up on LinkedIn if you want to discuss it further.