Ben Nadel
On User Experience (UX) Design, JavaScript, ColdFusion, Node.js, Life, and Love.
I am the chief technical officer at InVision App, Inc - a prototyping and collaboration platform for designers, built by designers. I also rock out in JavaScript and ColdFusion 24x7.
Meanwhile on Twitter
Loading latest tweet...
Ben Nadel at Scotch On The Rock (SOTR) 2010 (London) with:

Ask Ben: Closing XHTML Tags In A Truncated Message

By Ben Nadel on

Hey Ben, I have an internal developers forum, and when folks reply to messages I quote their original text (which contains html codes). If this text hits a certain threshold I chop the text at that point using this CF code/regex so I don't chop in the middle of a word: [code]

Before I run that bit of code I manually ReReplace (to remove) all of the possible tags folks can enter in a bunch of regex statements. The main problem was that if someone bolds, italicizes, underlines or even changes a font color - if that tag were left open, would colorize the rest of the messages in the forum following the originally cut message.

I've always wanted to do this in one regular expression. Perhaps I could back reference all of the previous found tags. Then close them in reverse order of where they were found after the chopping point, but I'm unsure this is possible? What do you think?

I don't think that this can be done with a single regular expression. And, since I have been doing a lot of work with the Java Pattern / Matcher lately, I am sure I am seeing it a solution that needs a problem, and therefore am trying to fit it into more places than it should go. That being said, I think the Java Pattern Matcher is going to be the most straight forward way of finding out which tags have been left open.

The idea here is that we are going to loop over the message and copy all the open tags into a tag stack. As we find tags that are closing tags, we can then pop one off of that stack in such a way that after we are done looping, the only tags left in the stack should be the ones that were not successfully closed. Self-closing tags can be ignored as the close themselves and cannot be left open.

To start off, let's simulate a forum message that contains unclosed tags:

  • <!---
  • Save text that contains unclosed HTML tags. In this case,
  • we are leaving all three tags (P, STRONG, EM) open.
  • --->
  • <cfsavecontent variable="strMessage">
  •  
  • <p>
  • Cassandra, I think this Hoops guys sounds like he's
  • really into you. Sure, maybe he lied to you about
  • being a basketball player, but he's got that goofy
  • charm I just <strong><em>know you love
  •  
  • </cfsavecontent>

Notice here that we are leaving the P, STRONG, and EM tags opened. These are the three tags that we hope to collect in out pattern matching and then close at the end. Notice here that our assumption is that the message is fully truncated. That is, that people didn't CLOSE the paragraph tag, but leave open the EM tag. If this is not the case, the algorithm will still work, but will not produce XHTML valid code.

Ok, let's take a look at the code:

  • <!---
  • Create an array to grab all of the open tags that
  • need to be closed.
  • --->
  • <cfset arrOpenTags = ArrayNew( 1 ) />
  •  
  • <!---
  • Create a pattern to match HTML tags. NOTE: This is not
  • a complete HTML tag matching regular expression (and does
  • not take into account attribute values with greater than
  • signs... but for our purposes it will due. We are going
  • to capture the closing slash and self-closing slash so
  • that we can easily tell what kind of tag we have.
  • --->
  • <cfset objPattern = CreateObject(
  • "java",
  • "java.util.regex.Pattern"
  • ).Compile(
  • JavaCast( "string", "<(/)?([a-z]+)[^>]*(/)?>" )
  • )
  • />
  •  
  • <!--- Grab the pattern matcher for our target text. --->
  • <cfset objMatcher = objPattern.Matcher(
  • JavaCast( "string", strMessage )
  • ) />
  •  
  • <!---
  • Now, we want to loop over the message collecting tags. For
  • each tag that we encounter, if its a self-closing tag we
  • want to ignore it. If it's an opening tag, we want to add
  • it to the stack and if its a closing tag, we want to pop
  • one tag off of the stack - Assuming valid XHTML, each close
  • tag should correspond to the TOP tag on the stack.
  • --->
  • <cfloop condition="objMatcher.Find()">
  •  
  • <!--- Grab the close slash. --->
  • <cfset REQUEST.Close = objMatcher.Group(
  • JavaCast( "int", 1 )
  • ) />
  •  
  • <!--- Grab the tag name. --->
  • <cfset REQUEST.Tag = objMatcher.Group(
  • JavaCast( "int", 2 )
  • ) />
  •  
  • <!--- Grab the self-close slash. --->
  • <cfset REQUEST.SelfClose = objMatcher.Group(
  • JavaCast( "int", 3 )
  • ) />
  •  
  •  
  • <!---
  • Since the two slashes are optional groups, they might
  • not exist. Therefore, we need to check to see if their
  • NULLness destroyed the variable in order to check for
  • matching.
  • --->
  • <cfif StructKeyExists( REQUEST, "SelfClose" )>
  •  
  • <!---
  • Self closing tags close themselves, so we don't
  • to worry about them.
  • --->
  •  
  • <cfelseif StructKeyExists( REQUEST, "Close" )>
  •  
  • <!---
  • This is a closing tag that, given properly
  • nested and valid XHTML, should correspond to the
  • tag on the top of the stack (bottom of our array).
  • Therefore, pop the tag off of the bottom.
  • --->
  • <cfset ArrayDeleteAt(
  • arrOpenTags,
  • ArrayLen( arrOpenTags )
  • ) />
  •  
  • <cfelse>
  •  
  • <!---
  • This is an open tag, so push in on to the top of
  • the stack (bottom of our array).
  • --->
  • <cfset ArrayAppend(
  • arrOpenTags,
  • REQUEST.Tag
  • ) />
  •  
  • </cfif>
  •  
  • </cfloop>
  •  
  •  
  • <!---
  • This this point, we have collected all the unopenned
  • tags in our stack. Now, all we have to do is loop over
  • the array (backwards) and close the tags in that order.
  • --->
  • <cfloop
  • index="intTagIndex"
  • from="#ArrayLen( arrOpenTags )#"
  • to="1"
  • step="-1">
  •  
  • <!--- Add the closing tag to the message. --->
  • <cfset strMessage = (
  • Trim( strMessage ) &
  • "</" &
  • arrOpenTags[ intTagIndex ] &
  • ">"
  • ) />
  •  
  • </cfloop>
  •  
  •  
  • <!--- Output updated message. --->
  • #strMessage#

Running the above code, the new message XHTML contains this:

  • <p>
  • Cassandra, I think this Hoops guys sounds like he's
  • really into you. Sure, maybe he <em>lied</em> to you
  • about being a basketball player, but he's got that
  • goofy charm I just <strong><em>know you love</em></strong></p>

Notice that the EM, STRONG, and P tags were closed in the reverse order in which they were found.

I know that this solution is probably a lot more involved and complicated than you were hoping it would be. And, this is a common problem, so it's entirely possible that there is a much shorter, sexier solution out there. But, if nothing else, hopefully this can point you in a good direction.



Reader Comments

This is a very cool solution to a problem I've been wrestling with lately as well. I came up with something that ended up in basically the same place, only not as nicely and I'm sure not as efficiently.

So the problems that remains for me are:
1. A tinyMCE plugin is inserting invalid XHTML img tags (no end slash). I'll probably just have to hack the plugin to fix this.

2. What if the truncated text ends in the middle of a tag? For example, in the text I'm chopping, there may be an img tag and they end up being long enough that chances are pretty good I'm going to end up somewhere in the middle of that tag. That means I now have to figure out if there is an unclosed tag ending the snippet and, if so, remove everything after the "<".

3. If I cut the text at, say, 100 characters and there are more than a couple of html tags in there, I end up with very little text and my snippet ends up being only a word or two. I need a way to take the first n characters which are not part of an html tag. Does that make sense?

This is indeed an icky problem that I'm sure has been solved many times over, but I'm just not finding that complete and elegant solution. Thanks for pointing me down at least a cleaner path.

Reply to this Comment

@Jeff,

I think these need to be handles separately.

1: The CFLoop can be updated to treat IMG tags as the same as a self-closing tag.

2: The truncation can be updated so that it doesn't go mid tag... or after the truncation takes place, we can clean out any non-closing tags.

3: Not sure what is the best solution for this. Hmmm.

Reply to this Comment

@Jeff,

Just a note, not certain of this, but if I recall correctly the latest version of tinyMCE has a parameter which can be configured to "turn on" XHTML adherence in its content. May solve your first problem.

-J-

Reply to this Comment

@Ben, a couple suggestions:

- You can use a negative lookbehind at the end of the tag to skip over self-closed elements, so you don't have to deal with them at all. The same goes for singleton elements like img, via a negative lookahead at the start of the tag.
- Since some HTML element names contain numbers (at least the h1 - h6 elements), the "Tag" group should probably allow for that.

So, the regex could end up as something like <(/)?(?!img|[hb]r)([a-z][a-z\d]*)[^>]*(?<!/)> with the CASE_INSENSITIVE modifier.

@Jeff, regarding item 2, since Java doesn't support regex conditionals (which would make this a little cleaner), you could use something like ^[\S\s]{1,100}(?:(?<=<[^>]{0,99})[^>]*>)? to get the first 100 characters, unless it ends in the middle of a tag, in which case it will grab up to the end of the tag. The {0,99} quantifier is used instead of * because Java doesn't support unbounded quantification in lookbehind.

Regarding item 3, while this wouldn't be perfect (it would treat HTML tags as one character instead of zero, as it should), you could use something simple like ^(?:[^<]|<[^>]+>){1,100} to improve the situation a bit.

Reply to this Comment

@Jason, thanks. I'll look into your tinyMCE tip. That'd be nifty.

@Steven, frankly, I'm kind of deer-in-headlights stunned at the beauty and complexity of your regex's. I had convinced myself that what you've got there was not possible last weekend. In fact, ironically, after a couple of hours of reading through stuff on your blog, I started an email to you to ask the same question, but by the end of the message I had, as I said, convinced myself it was not possible, so I trashed the email. Now, to see if I can actually work this into my CF code and make it work.

@Ben, thanks for your suggestions and for taking on this topic in the first place. Incredibly timely for me.

Reply to this Comment

@Jeff, thanks!

BTW, here's a quick and simple but complete implementation in JavaScript which solves your issues numbered 1, 2, and 3 (it should be pretty straightforward to convert it to ColdFusion):

----------
var maxChrs = 100,
    re = new RegExp("[^<]{1," + maxChrs + "}|(<(/)?(\\w+)[^>]*(/)?>)", "g"),
    match,
    output = "",
    chrCount = 0,
    openTags = [],
    selfClosingTag = /^(?:img|[hb]r)$/;

while ((match = re.exec(str)) && (chrCount < maxChrs)) {
    // If this is an HTML tag
    if (match[1]) {
        output += match[0];
        // If this is not a self-closing tag
        if (!(match[4] || selfClosingTag.test(match[3]))) {
            // If this is a closing tag
            if (match[2]) {
                openTags.pop();
            } else {
                openTags.push(match[3]);
            }
        }
    } else {
        output += match[0].substring(0, maxChrs - chrCount);
        chrCount += match[0].length;
    }
}

for (var i = openTags.length - 1; i >= 0; i--) {
    output += "</" + openTags[i] + ">";
}
----------

The input string is expected to be named str. maxChrs indicates how many characters outside of HTML tags you want to return (while ensuring that the string never ends in the middle of a tag, and all tags get closed). The above intentionally avoids looping over the string one character at a time (while counting the characters outside of HTML tags), for efficiency reasons. If IE supported the indexOf method for arrays from JavaScript 1.6, I would make selfClosingTag an array of strings instead of a regex.

Reply to this Comment

Is there a cfm version of getLeadingHtml javascript function by Steven?
<code>
function getLeadingHtml (input, maxChars) {
// token matches a word, tag, or special character
var token = /\w+|[^\w<]|<(\/)?(\w+)[^>]*(\/)?>|</g,
selfClosingTag = /^(?:[hb]r|img)$/i,
output = "",
charCount = 0,
openTags = [],
match;

// Set the default for the max number of characters
// (only counts characters outside of HTML tags)
maxChars = maxChars || 250;

while ((charCount < maxChars) && (match = token.exec(input))) {
// If this is an HTML tag
if (match[2]) {
output += match[0];
// If this is not a self-closing tag
if (!(match[3] || selfClosingTag.test(match[2]))) {
// If this is a closing tag
if (match[1]) openTags.pop();
else openTags.push(match[2]);
}
} else {
charCount += match[0].length;
if (charCount <= maxChars) output += match[0];
}
}

// Close any tags which were left open
var i = openTags.length;
while (i--) output += "</" + openTags[i] + ">";

return output;
};
</code>

Reply to this Comment

I re-wrote Steve's JS code as a CFM function, with a minor improvement (better tag auto-closing logic). Many thanks to Steve! That regex solution is brilliant, really helped me out.

Just posted to cflib.org as "getLeadingHTML" -- should be up soon (tried pasting here but I'm guessing it was too long).

Cheers,

-Max

Reply to this Comment

@Max,

Yeah, Steve is a beast :) If you haven't already, I would highly recommend his O'Reilly book - Regular Expression Cookbook.

Reply to this Comment

Post A Comment

You — Get Out Of My Dreams, Get Into My Comments
Live in the Now
Oops!
Comment Etiquette: Please do not post spam. Please keep the comments on-topic. Please do not post unrelated questions or large chunks of code. And, above all, please be nice to each other - we're trying to have a good conversation here.