Hey Ben, I have an internal developers forum, and when folks reply to messages I quote their original text (which contains html codes). If this text hits a certain threshold I chop the text at that point using this CF code/regex so I don't chop in the middle of a word: [code]
Before I run that bit of code I manually ReReplace (to remove) all of the possible tags folks can enter in a bunch of regex statements. The main problem was that if someone bolds, italicizes, underlines or even changes a font color - if that tag were left open, would colorize the rest of the messages in the forum following the originally cut message.
I've always wanted to do this in one regular expression. Perhaps I could back reference all of the previous found tags. Then close them in reverse order of where they were found after the chopping point, but I'm unsure this is possible? What do you think?
I don't think that this can be done with a single regular expression. And, since I have been doing a lot of work with the Java Pattern / Matcher lately, I am sure I am seeing it a solution that needs a problem, and therefore am trying to fit it into more places than it should go. That being said, I think the Java Pattern Matcher is going to be the most straight forward way of finding out which tags have been left open.
The idea here is that we are going to loop over the message and copy all the open tags into a tag stack. As we find tags that are closing tags, we can then pop one off of that stack in such a way that after we are done looping, the only tags left in the stack should be the ones that were not successfully closed. Self-closing tags can be ignored as the close themselves and cannot be left open.
To start off, let's simulate a forum message that contains unclosed tags:
Launch code in new window » Download code as text file »
Notice here that we are leaving the P, STRONG, and EM tags opened. These are the three tags that we hope to collect in out pattern matching and then close at the end. Notice here that our assumption is that the message is fully truncated. That is, that people didn't CLOSE the paragraph tag, but leave open the EM tag. If this is not the case, the algorithm will still work, but will not produce XHTML valid code.
Ok, let's take a look at the code:
Launch code in new window » Download code as text file »
Running the above code, the new message XHTML contains this:
Launch code in new window » Download code as text file »
Notice that the EM, STRONG, and P tags were closed in the reverse order in which they were found.
I know that this solution is probably a lot more involved and complicated than you were hoping it would be. And, this is a common problem, so it's entirely possible that there is a much shorter, sexier solution out there. But, if nothing else, hopefully this can point you in a good direction.
Download Code Snippet ZIP File
Comments (7) | Post Comment | Ask Ben | Permalink | Print Page
This is a very cool solution to a problem I've been wrestling with lately as well. I came up with something that ended up in basically the same place, only not as nicely and I'm sure not as efficiently.
So the problems that remains for me are:
1. A tinyMCE plugin is inserting invalid XHTML img tags (no end slash). I'll probably just have to hack the plugin to fix this.
2. What if the truncated text ends in the middle of a tag? For example, in the text I'm chopping, there may be an img tag and they end up being long enough that chances are pretty good I'm going to end up somewhere in the middle of that tag. That means I now have to figure out if there is an unclosed tag ending the snippet and, if so, remove everything after the "<".
3. If I cut the text at, say, 100 characters and there are more than a couple of html tags in there, I end up with very little text and my snippet ends up being only a word or two. I need a way to take the first n characters which are not part of an html tag. Does that make sense?
This is indeed an icky problem that I'm sure has been solved many times over, but I'm just not finding that complete and elegant solution. Thanks for pointing me down at least a cleaner path.
Posted by Jeff on Oct 1, 2007 at 3:37 PM
@Jeff,
I think these need to be handles separately.
1: The CFLoop can be updated to treat IMG tags as the same as a self-closing tag.
2: The truncation can be updated so that it doesn't go mid tag... or after the truncation takes place, we can clean out any non-closing tags.
3: Not sure what is the best solution for this. Hmmm.
Posted by Ben Nadel on Oct 1, 2007 at 3:48 PM
@Jeff,
Just a note, not certain of this, but if I recall correctly the latest version of tinyMCE has a parameter which can be configured to "turn on" XHTML adherence in its content. May solve your first problem.
-J-
Posted by Jason on Oct 1, 2007 at 5:29 PM
@Ben, a couple suggestions:
- You can use a negative lookbehind at the end of the tag to skip over self-closed elements, so you don't have to deal with them at all. The same goes for singleton elements like img, via a negative lookahead at the start of the tag.
- Since some HTML element names contain numbers (at least the h1 - h6 elements), the "Tag" group should probably allow for that.
So, the regex could end up as something like <(/)?(?!img|[hb]r)([a-z][a-z\d]*)[^>]*(?<!/)> with the CASE_INSENSITIVE modifier.
@Jeff, regarding item 2, since Java doesn't support regex conditionals (which would make this a little cleaner), you could use something like ^[\S\s]{1,100}(?:(?<=<[^>]{0,99})[^>]*>)? to get the first 100 characters, unless it ends in the middle of a tag, in which case it will grab up to the end of the tag. The {0,99} quantifier is used instead of * because Java doesn't support unbounded quantification in lookbehind.
Regarding item 3, while this wouldn't be perfect (it would treat HTML tags as one character instead of zero, as it should), you could use something simple like ^(?:[^<]|<[^>]+>){1,100} to improve the situation a bit.
Posted by Steven Levithan on Oct 1, 2007 at 6:20 PM
@Jason, thanks. I'll look into your tinyMCE tip. That'd be nifty.
@Steven, frankly, I'm kind of deer-in-headlights stunned at the beauty and complexity of your regex's. I had convinced myself that what you've got there was not possible last weekend. In fact, ironically, after a couple of hours of reading through stuff on your blog, I started an email to you to ask the same question, but by the end of the message I had, as I said, convinced myself it was not possible, so I trashed the email. Now, to see if I can actually work this into my CF code and make it work.
@Ben, thanks for your suggestions and for taking on this topic in the first place. Incredibly timely for me.
Posted by Jeff on Oct 1, 2007 at 8:01 PM
@Jeff, thanks!
BTW, here's a quick and simple but complete implementation in JavaScript which solves your issues numbered 1, 2, and 3 (it should be pretty straightforward to convert it to ColdFusion):
----------
var maxChrs = 100,
re = new RegExp("[^<]{1," + maxChrs + "}|(<(/)?(\\w+)[^>]*(/)?>)", "g"),
match,
output = "",
chrCount = 0,
openTags = [],
selfClosingTag = /^(?:img|[hb]r)$/;
while ((match = re.exec(str)) && (chrCount < maxChrs)) {
// If this is an HTML tag
if (match[1]) {
output += match[0];
// If this is not a self-closing tag
if (!(match[4] || selfClosingTag.test(match[3]))) {
// If this is a closing tag
if (match[2]) {
openTags.pop();
} else {
openTags.push(match[3]);
}
}
} else {
output += match[0].substring(0, maxChrs - chrCount);
chrCount += match[0].length;
}
}
for (var i = openTags.length - 1; i >= 0; i--) {
output += "</" + openTags[i] + ">";
}
----------
The input string is expected to be named str. maxChrs indicates how many characters outside of HTML tags you want to return (while ensuring that the string never ends in the middle of a tag, and all tags get closed). The above intentionally avoids looping over the string one character at a time (while counting the characters outside of HTML tags), for efficiency reasons. If IE supported the indexOf method for arrays from JavaScript 1.6, I would make selfClosingTag an array of strings instead of a regex.
Posted by Steven Levithan on Oct 2, 2007 at 1:10 PM
I just posted a modified version of the above (which also avoids ending the string in the middle of words) at http://blog.stevenlevithan.com/archives/get-html-summary
Posted by Steven Levithan on Oct 3, 2007 at 12:27 AM