Hey Ben, I have an internal developers forum, and when folks reply to messages I quote their original text (which contains html codes). If this text hits a certain threshold I chop the text at that point using this CF code/regex so I don't chop in the middle of a word: [code]
Before I run that bit of code I manually ReReplace (to remove) all of the possible tags folks can enter in a bunch of regex statements. The main problem was that if someone bolds, italicizes, underlines or even changes a font color - if that tag were left open, would colorize the rest of the messages in the forum following the originally cut message.
I've always wanted to do this in one regular expression. Perhaps I could back reference all of the previous found tags. Then close them in reverse order of where they were found after the chopping point, but I'm unsure this is possible? What do you think?
I don't think that this can be done with a single regular expression. And, since I have been doing a lot of work with the Java Pattern / Matcher lately, I am sure I am seeing it a solution that needs a problem, and therefore am trying to fit it into more places than it should go. That being said, I think the Java Pattern Matcher is going to be the most straight forward way of finding out which tags have been left open.
The idea here is that we are going to loop over the message and copy all the open tags into a tag stack. As we find tags that are closing tags, we can then pop one off of that stack in such a way that after we are done looping, the only tags left in the stack should be the ones that were not successfully closed. Self-closing tags can be ignored as the close themselves and cannot be left open.
To start off, let's simulate a forum message that contains unclosed tags:
<!--- Save text that contains unclosed HTML tags. In this case, we are leaving all three tags (P, STRONG, EM) open. ---> <cfsavecontent variable="strMessage"> <p> Cassandra, I think this Hoops guys sounds like he's really into you. Sure, maybe he lied to you about being a basketball player, but he's got that goofy charm I just <strong><em>know you love </cfsavecontent>
Notice here that we are leaving the P, STRONG, and EM tags opened. These are the three tags that we hope to collect in out pattern matching and then close at the end. Notice here that our assumption is that the message is fully truncated. That is, that people didn't CLOSE the paragraph tag, but leave open the EM tag. If this is not the case, the algorithm will still work, but will not produce XHTML valid code.
Ok, let's take a look at the code:
<!--- Create an array to grab all of the open tags that need to be closed. ---> <cfset arrOpenTags = ArrayNew( 1 ) /> <!--- Create a pattern to match HTML tags. NOTE: This is not a complete HTML tag matching regular expression (and does not take into account attribute values with greater than signs... but for our purposes it will due. We are going to capture the closing slash and self-closing slash so that we can easily tell what kind of tag we have. ---> <cfset objPattern = CreateObject( "java", "java.util.regex.Pattern" ).Compile( JavaCast( "string", "<(/)?([a-z]+)[^>]*(/)?>" ) ) /> <!--- Grab the pattern matcher for our target text. ---> <cfset objMatcher = objPattern.Matcher( JavaCast( "string", strMessage ) ) /> <!--- Now, we want to loop over the message collecting tags. For each tag that we encounter, if its a self-closing tag we want to ignore it. If it's an opening tag, we want to add it to the stack and if its a closing tag, we want to pop one tag off of the stack - Assuming valid XHTML, each close tag should correspond to the TOP tag on the stack. ---> <cfloop condition="objMatcher.Find()"> <!--- Grab the close slash. ---> <cfset REQUEST.Close = objMatcher.Group( JavaCast( "int", 1 ) ) /> <!--- Grab the tag name. ---> <cfset REQUEST.Tag = objMatcher.Group( JavaCast( "int", 2 ) ) /> <!--- Grab the self-close slash. ---> <cfset REQUEST.SelfClose = objMatcher.Group( JavaCast( "int", 3 ) ) /> <!--- Since the two slashes are optional groups, they might not exist. Therefore, we need to check to see if their NULLness destroyed the variable in order to check for matching. ---> <cfif StructKeyExists( REQUEST, "SelfClose" )> <!--- Self closing tags close themselves, so we don't to worry about them. ---> <cfelseif StructKeyExists( REQUEST, "Close" )> <!--- This is a closing tag that, given properly nested and valid XHTML, should correspond to the tag on the top of the stack (bottom of our array). Therefore, pop the tag off of the bottom. ---> <cfset ArrayDeleteAt( arrOpenTags, ArrayLen( arrOpenTags ) ) /> <cfelse> <!--- This is an open tag, so push in on to the top of the stack (bottom of our array). ---> <cfset ArrayAppend( arrOpenTags, REQUEST.Tag ) /> </cfif> </cfloop> <!--- This this point, we have collected all the unopenned tags in our stack. Now, all we have to do is loop over the array (backwards) and close the tags in that order. ---> <cfloop index="intTagIndex" from="#ArrayLen( arrOpenTags )#" to="1" step="-1"> <!--- Add the closing tag to the message. ---> <cfset strMessage = ( Trim( strMessage ) & "</" & arrOpenTags[ intTagIndex ] & ">" ) /> </cfloop> <!--- Output updated message. ---> #strMessage#
Running the above code, the new message XHTML contains this:
<p> Cassandra, I think this Hoops guys sounds like he's really into you. Sure, maybe he <em>lied</em> to you about being a basketball player, but he's got that goofy charm I just <strong><em>know you love</em></strong></p>
Notice that the EM, STRONG, and P tags were closed in the reverse order in which they were found.
I know that this solution is probably a lot more involved and complicated than you were hoping it would be. And, this is a common problem, so it's entirely possible that there is a much shorter, sexier solution out there. But, if nothing else, hopefully this can point you in a good direction.
Want to use code from this post? Check out the license.