Cleaning High Ascii Values For Web Safeness In ColdFusion

Posted February 14, 2008 at 8:18 AM

Tags: ColdFusion

On Tuesday, Ray Camden came to speak at the New York ColdFusion User Group about RSS feed parsing and creation. During the presentation, he talked about RSS breaking when ColdFusion hits a high ascii value that it doesn't recognize (and it just skips it or something, which will break the XML). This got me thinking about how to remove high ascii values from a string, or at least to clean them to be more web safe. I think we have all done this before, where we have to clean FORM submission data to make sure someone didn't copy and paste from Microsoft Words and get some crazy "smart" characters. In the past, to do this, I have done lots of Replace() calls on the FORM values.

But, Tuesday night, when Ray was speaking, I started, as I often do, to think about regular expressions. Regular expressions have brought so much joy and happiness into my life, I wondered if maybe they could help me with this problem as well. And so, I hopped over to the Java 2 Pattern class documentation to see what it could handle. Immediately, I was quite pleased to see that it had a way to match patterns based on the hexadecimal value of characters:

\xhh - The character with hexadecimal value 0xhh

Since we all know about the ASCII table and how to convert decimal values to hexadecimal values (or at least how to look it up), we can easily put together a regular expression pattern to match high ascii values. The only super, ultra safe characters are the first 128 characters (ascii values 0 - 127); these are HEX values 00 to 7F. Taking that information, we can now build a pattern that matches characters that are NOT in that ascii value range:

[^\x00-\x7F]

With that pattern, we are going to have access not only to the high ascii values we know exist (such as the Microsoft Smart Quotes), we are going to also have access to all the random high ascii values that people randomly enter with their data. This means that we are not going to let anything slip through the cracks.

To encapsulate this functionality, I have create a ColdFusion user defined function, CleanHighAscii():

 Launch code in new window » Download code as text file »

  • <cffunction
  • name="CleanHighAscii"
  • access="public"
  • returntype="string"
  • output="false"
  • hint="Cleans extended ascii values to make the as web safe as possible.">
  •  
  • <!--- Define arguments. --->
  • <cfargument
  • name="Text"
  • type="string"
  • required="true"
  • hint="The string that we are going to be cleaning."
  • />
  •  
  • <!--- Set up local scope. --->
  • <cfset var LOCAL = {} />
  •  
  • <!---
  • When cleaning the string, there are going to be ascii
  • values that we want to target, but there are also going
  • to be high ascii values that we don't expect. Therefore,
  • we have to create a pattern that simply matches all non
  • low-ASCII characters. This will find all characters that
  • are NOT in the first 127 ascii values. To do this, we
  • are using the 2-digit hex encoding of values.
  • --->
  • <cfset LOCAL.Pattern = CreateObject(
  • "java",
  • "java.util.regex.Pattern"
  • ).Compile(
  • JavaCast( "string", "[^\x00-\x7F]" )
  • )
  • />
  •  
  • <!---
  • Create the pattern matcher for our target text. The
  • matcher will be able to loop through all the high
  • ascii values found in the target string.
  • --->
  • <cfset LOCAL.Matcher = LOCAL.Pattern.Matcher(
  • JavaCast( "string", ARGUMENTS.Text )
  • ) />
  •  
  •  
  • <!---
  • As we clean the string, we are going to need to build
  • a results string buffer into which the Matcher will
  • be able to store the clean values.
  • --->
  • <cfset LOCAL.Buffer = CreateObject(
  • "java",
  • "java.lang.StringBuffer"
  • ).Init() />
  •  
  •  
  • <!--- Keep looping over high ascii values. --->
  • <cfloop condition="LOCAL.Matcher.Find()">
  •  
  • <!--- Get the matched high ascii value. --->
  • <cfset LOCAL.Value = LOCAL.Matcher.Group() />
  •  
  • <!--- Get the ascii value of our character. --->
  • <cfset LOCAL.AsciiValue = Asc( LOCAL.Value ) />
  •  
  • <!---
  • Now that we have the high ascii value, we need to
  • figure out what to do with it. There are explicit
  • tests we can perform for our replacements. However,
  • if we don't have a match, we need a default
  • strategy and that will be to just store it as an
  • escaped value.
  • --->
  •  
  • <!--- Check for Microsoft double smart quotes. --->
  • <cfif (
  • (LOCAL.AsciiValue EQ 8220) OR
  • (LOCAL.AsciiValue EQ 8221)
  • )>
  •  
  • <!--- Use standard quote. --->
  • <cfset LOCAL.Value = """" />
  •  
  • <!--- Check for Microsoft single smart quotes. --->
  • <cfelseif (
  • (LOCAL.AsciiValue EQ 8216) OR
  • (LOCAL.AsciiValue EQ 8217)
  • )>
  •  
  • <!--- Use standard quote. --->
  • <cfset LOCAL.Value = "'" />
  •  
  • <!--- Check for Microsoft elipse. --->
  • <cfelseif (LOCAL.AsciiValue EQ 8230)>
  •  
  • <!--- Use several periods. --->
  • <cfset LOCAL.Value = "..." />
  •  
  • <cfelse>
  •  
  • <!---
  • We didn't get any explicit matches on our
  • character, so just store the escaped value.
  • --->
  • <cfset LOCAL.Value = "&###LOCAL.AsciiValue#;" />
  •  
  • </cfif>
  •  
  •  
  • <!---
  • Add the cleaned high ascii character into the
  • results buffer. Since we know we will only be
  • working with extended values, we know that we don't
  • have to worry about escaping any special characters
  • in our target string.
  • --->
  • <cfset LOCAL.Matcher.AppendReplacement(
  • LOCAL.Buffer,
  • JavaCast( "string", LOCAL.Value )
  • ) />
  •  
  • </cfloop>
  •  
  • <!---
  • At this point there are no further high ascii values
  • in the string. Add the rest of the target text to the
  • results buffer.
  • --->
  • <cfset LOCAL.Matcher.AppendTail(
  • LOCAL.Buffer
  • ) />
  •  
  •  
  • <!--- Return the resultant string. --->
  • <cfreturn LOCAL.Buffer.ToString() />
  • </cffunction>

Here, we are checking for some very specific ascii values (all Microsoft characters), but if we cannot find an explicit match, we do our best to provide a web-safe character by returning the alternate escaped ascii value (&#ASCII;). Let's take a look in this in action:

 Launch code in new window » Download code as text file »

  • <!---
  • Set up text that has foreign characters. These foreign
  • characters are in the "extended" ascii group.
  • --->
  • <cfsavecontent variable="strText">
  • Bonjour. Vous êtes très mignon, et je voudrais vraiment
  • votre prise en mains (ou, à s'emparer de vos fesses,
  • même si je pense que peut-être trop en avant à ce moment).
  • </cfsavecontent>
  •  
  • <!--- Output the cleaned value. --->
  • #CleanHighAscii( strText )#

We are passing our French text to the function and then outputting the result. Here is what the resultant HTML looks like:

Bonjour. Vous &#234;tes tr&#232;s mignon, et je voudrais vraiment votre prise en mains (ou, &#224; s'emparer de vos fesses, m&#234;me si je pense que peut-&#234;tre trop en avant &#224; ce moment).

Notice that the high ascii values (just the extended ones in our case) were replaced with their safer escaped value counterparts.

As you find more special characters that you need to work with, you can, of course, update the CFIF / CFELSEIF statements in the function; but, until you do that, I think this provides a safer way to handle high ascii values on the web. At the very least, it's cool to see that regular expressions can make our lives better yet again.

Download Code Snippet ZIP File

Comments (22)  |  Post Comment  |  Ask Ben  |  Permalink  |  Other Searches  |  Print Page



Adobe ColdFusion 8.0.1 Update - Helping Programmers To Be Signifanctly Less Girlie - Download ColdFusion 8 Update 8.0.1 Now.

Reader Comments

Very, very nice. Getting the range working was something I had issues with when I tried this last time - but seeing it now it looks so simple! :) I'm going to update toXML later today to include this code.

Posted by Raymond Camden on Feb 14, 2008 at 9:04 AM


@Ray,

Glad you like it :)

Posted by Ben Nadel on Feb 14, 2008 at 9:19 AM


I wrote these 2 functions for this similar problem. (not sure if they will paste correctly or not):

<samp>
<cffunction name="replaceNonAscii" returntype="string" output="false">
<cfargument name="argString" type="string" default="" />
<cfreturn REReplace(arguments.argString,"[^\0-\x80]","","all") />
</cffunction>

<cffunction name="replaceDiacriticMarks" returntype="string" output="false">
<cfargument name="argString" type="string" default="" />
<!--- Declare retString --->
<cfset var retString = arguments.argString />

<!--- Do Replaces --->
<cfset retString = REReplace(retString,"#chr(192)#|#chr(193)#|#chr(194)#|#chr(195)#|#chr(196)#|#chr(197)#|#chr(913)#|#chr(8704)#","A","all") />
<cfset retString = REReplace(retString,"#chr(198)#","AE","all") />
<cfset retString = REReplace(retString,"#chr(223)#|#chr(914)#|#chr(946)#","B","all") />
<cfset retString = REReplace(retString,"#chr(162)#|#chr(169)#|#chr(199)#|#chr(231)#|#chr(8834)#|#chr(8835)#|#chr(8836)#|#chr(8838)#|#chr(8839)#|#chr(962)#","C","all") />
<cfset retString = REReplace(retString,"#chr(208)#|#chr(272)#","D","all") />
<cfset retString = REReplace(retString,"#chr(200)#|#chr(201)#|#chr(202)#|#chr(203)#|#chr(8364)#|#chr(8707)#|#chr(8712)#|#chr(8713)#|#chr(8715)#|#chr(8721)#|#chr(917)#|#chr(926)#|#chr(931)#|#chr(949)#|#chr(958)#","E","all") />
<cfset retString = REReplace(retString,"#chr(294)#|#chr(919)#","H","all") />
<cfset retString = REReplace(retString,"#chr(204)#|#chr(205)#|#chr(206)#|#chr(207)#|#chr(8465)#|#chr(921)#","I","all") />
<cfset retString = REReplace(retString,"#chr(306)#","IJ","all") />
<cfset retString = REReplace(retString,"#chr(312)#|#chr(922)#|#chr(954)#","K","all") />
<cfset retString = REReplace(retString,"#chr(319)#|#chr(321)#|#chr(915)#","L","all") />
<cfset retString = REReplace(retString,"#chr(924)#","M","all") />
<cfset retString = REReplace(retString,"#chr(209)#|#chr(330)#|#chr(925)#","N","all") />
<cfset retString = REReplace(retString,"#chr(210)#|#chr(211)#|#chr(212)#|#chr(213)#|#chr(214)#|#chr(216)#|#chr(920)#|#chr(927)#|#chr(934)#","O","all") />
<cfset retString = REReplace(retString,"#chr(338)#","OE","all") />
<cfset retString = REReplace(retString,"#chr(174)#|#chr(8476)#","R","all") />
<cfset retString = REReplace(retString,"#chr(167)#|#chr(352)#","S","all") />
<cfset retString = REReplace(retString,"#chr(358)#|#chr(932)#","T","all") />
<cfset retString = REReplace(retString,"#chr(217)#|#chr(218)#|#chr(219)#|#chr(220)#","U","all") />
<cfset retString = REReplace(retString,"#chr(935)#|#chr(967)#","X","all") />
<cfset retString = REReplace(retString,"#chr(165)#|#chr(221)#|#chr(376)#|#chr(933)#|#chr(936)#|#chr(947)#|#chr(978)#","Y","all") />
<cfset retString = REReplace(retString,"#chr(918)#|#chr(950)#","Z","all") />
<cfset retString = REReplace(retString,"#chr(170)#|#chr(224)#|#chr(225)#|#chr(226)#|#chr(227)#|#chr(228)#|#chr(229)#|#chr(945)#","a","all") />
<cfset retString = REReplace(retString,"#chr(230)#","ae","all") />
<cfset retString = REReplace(retString,"#chr(273)#|#chr(8706)#|#chr(948)#","d","all") />
<cfset retString = REReplace(retString,"#chr(232)#|#chr(233)#|#chr(234)#|#chr(235)#","e","all") />
<cfset retString = REReplace(retString,"#chr(402)#|#chr(8747)#","f","all") />
<cfset retString = REReplace(retString,"#chr(295)#","h","all") />
<cfset retString = REReplace(retString,"#chr(236)#|#chr(237)#|#chr(238)#|#chr(239)#|#chr(305)#|#chr(953)#","i","all") />
<cfset retString = REReplace(retString,"#chr(307)#","j","all") />
<cfset retString = REReplace(retString,"#chr(320)#|#chr(322)#","l","all") />
<cfset retString = REReplace(retString,"#chr(241)#|#chr(329)#|#chr(331)#|#chr(951)#","n","all") />
<cfset retString = REReplace(retString,"#chr(240)#|#chr(242)#|#chr(243)#|#chr(244)#|#chr(245)#|#chr(246)#|#chr(248)#|#chr(959)#","o","all") />
<cfset retString = REReplace(retString,"#chr(339)#","oe","all") />
<cfset retString = REReplace(retString,"#chr(222)#|#chr(254)#|#chr(8472)#|#chr(929)#|#chr(961)#","p","all") />
<cfset retString = REReplace(retString,"#chr(353)#|#chr(383)#","s","all") />
<cfset retString = REReplace(retString,"#chr(359)#|#chr(964)#","t","all") />
<cfset retString = REReplace(retString,"#chr(181)#|#chr(249)#|#chr(250)#|#chr(251)#|#chr(252)#|#chr(956)#|#chr(965)#","u","all") />
<cfset retString = REReplace(retString,"#chr(957)#","v","all") />
<cfset retString = REReplace(retString,"#chr(969)#","w","all") />
<cfset retString = REReplace(retString,"#chr(215)#|#chr(8855)#","x","all") />
<cfset retString = REReplace(retString,"#chr(253)#|#chr(255)#","y","all") />
<!--- ' --->
<cfset retString = REReplace(retString,"#chr(180)#|#chr(8242)#|#chr(8216)#|#chr(8217)#","#chr(39)#","all") />
<!--- " --->
<cfset retString = REReplace(retString,"#chr(168)#|#chr(8220)#|#chr(8221)#|#chr(8222)#|#chr(8243)#","#chr(34)#","all") />

<cfreturn retString />
</cffunction>
</samp>

You can call one after the other. I usually call the Trim() function as well. The replaceDiacriticMarks() function replaces character with their "similar' standard ascii values, so è gets turned into e.

Posted by Jeff on Feb 14, 2008 at 9:29 AM


@Jeff,

Looking pretty cool. I like that you get the "like looking" letters for foreign characters. I can really see where that would be good to have on hand.

Posted by Ben Nadel on Feb 14, 2008 at 9:36 AM


@Ben,

I still think it's bad to be doing *that* many replaces, but for my needs it works (seems to perform well/quickly).

The replaceAscii() just gets rid of the chars completely, which is probably worse than what your doing...

I usually just call this function (which uses the 2 other functions I pasted in):

<cffunction name="safeForXml" returntype="string" output="false">
<cfargument name="argString" type="string" default="" />
<cfset var retString = arguments.argString />
<cfset retString = replaceDiacriticMarks(retString) />
<cfset retString = replaceNonAscii(retString) />
<cfset retString = Trim(retString) />
<cfreturn retString />
</cffunction>

Posted by Jeff on Feb 14, 2008 at 10:00 AM


@Jeff,

That sounds fair to me. That pretty cool that you already knew how to use the HEX values in the regex. I think that's way awesome. I love regular expressions.

I think the only real difference that mine has is that it has a default case that replace any high ascii values that were not accounted for. Other than that, I think we are headed in the same direction.

Posted by Ben Nadel on Feb 14, 2008 at 10:10 AM


Rather than doing all of those cfsets and replaces, could you perhaps return an array of matched high ascii characters (just the numeric values), then loop over that list and replace those values with matching values from another array of replacements?

e.g.
<cfloop array="#matchingHighAscii#" index="value">
<cfset myString = Replace(myString, "&#" & value & ";", lowAscii[value], "ALL")>
</cfloop>

Posted by Gareth on Feb 14, 2008 at 12:59 PM


@Gareth,
That would definitely be much cleaner, and easier to read/add to. Thanks.

Posted by Jeff on Feb 14, 2008 at 2:05 PM


Wow, very timely entry, Ben! Yesterday I discovered a need to convert extended ASCII characters into their HTML encoded entities... and then I saw this post in my feed reader. Thanks. This is perfect.

Posted by Richard Davies on Feb 15, 2008 at 11:57 AM


@Richard,

Glad to be of help. Be sure to read the comments as well as some other very good ideas were presented.

Posted by Ben Nadel on Feb 15, 2008 at 12:09 PM


Thanks guys for all your comments. I have been working at a solution of my own, inspired by some of the solutions here. I specifically wanted to convert all high characters with a value greater then 127 into XML safe hexadecimal characters. I came up with a small solution that doesn't rely on any external objects yet was based on a regular expression.

It can be found at my blog ..
http://ipggi.wordpress.com/2008/03/11/remove-or-clean-high-extended-ascii-characters-in-coldfusion-for-xml-safeness/

Posted by Ben Garrett on Mar 11, 2008 at 5:52 PM


@Jeff,

You saved me countless hours of heartache tonight. Thanks

DB

Posted by david buhler on Mar 14, 2008 at 2:49 AM


Handling unicode bytes is unfortunately a little more complicated than stripping "bad" characters.

The issue is that multi-byte characters may start with a byte outside of the 0x00-0x7F range, but may then be composed of 1-3 additional bytes which are within this range. So when you strip the "bad" bytes, you're snagging only the first byte out of a multi-byte character. It can lead to garbage characters in your output.

You're also left completely without the ability to handle many foreign languages, your code can only handle latin characters, and that's just too restrictive for many folks.

UTF-8 is the most common international multi-byte character encoding since it's incredibly comprehensive (UTF-16 is yet more so, but I believe it mostly only includes traditional Chinese characters [1 distinct character for each possible word] over UTF-8, and even most Chinese sites go with modern Chinese).

You'll discover that putting <cfprocessingdirective pageencoding="utf-8"> in the first 1024 bytes of each template will cause CF and all of CF's string functions to correctly handle "high-ascii" characters. You will also need to alert the browser (or other consumer of your data) what the character set is. There are various methods to do this, depending on what type of content you're sending. For most types of content, a simple way is to use <cfcontent type="(content-type such as text/html); charset=utf-8">. There are also HTML head entries which can specify the character encoding, and an XML processing instruction for this purpose too (<?xml version="1.0" encoding="utf-8" ?>) which would cover your RSS feeds.

Man, I worked long and hard on getting regular expressions to sanitize my data, and the outcome of that is that it's simply virtually impossible with regular expressions, you really need a string system which treats multi-byte characters as a single character, and the way to do that in ColdFusion is with the aforementioned processing directive.

Posted by Eric on Mar 19, 2008 at 11:31 AM


Eric, it seems like you are saying that simply adding the cfcontent, and the charset in the <xml> tag, would solve _everything_. Is that true? If so - that doesn't seem to be what I'm seeing locally. I know BlogCFC does this (although it uses cfprocessingdirective instead of specifying it on cfcontent) and folks still have issues with the 'high ascii'.

Posted by Raymond Camden on Mar 19, 2008 at 11:51 AM


You have to both do a <cfprocessingdirective> (this tells ColdFusion what character set it's working with), and also inform the browser (or other client) what character set you're sending them (since once you set the character encoding with <cfprocessingdirective>, it will send characters to the client in that encoding for any output initiated from that template). Both sides have to know what the encoding is, and have to agree.

Also remember that <cfprocessingdirective> has to appear at the top of every template which may be doing any work with multi-byte strings (whether with string-related functions in ColdFusion, or outputting multi-byte strings to the browser). By multi-byte strings, I mean anything dealing with characters over 0x7F (character 128 and beyond, right well past character 256). Notably for UTF-8, character 128 uses two bytes, but does not have to be represented with the HTML entity €.

For more information on how to inform various clients which character encoding you're sending them, see this W3C article: http://www.w3.org/International/O-charset

I usually stick to pure UTF-8, it has never been insufficient for my needs. Ideally you'd want this to be some sort of application configuration preference, but in reality that may be unrealistic to your code, and as I mentioned earlier, UTF-8 will cover 99.9% of what anyone in the world would want to do; those who want to do something which requires an even larger character set typically already know how to do so.

So a little background.
The way that UTF-8 works is that the high-bit of the first byte in a given character indicates whether it is a member of a multi-byte character.

In binary, single characters may be made up of the following byte sequences:
0zzzzzzzz
110yyyyy 10zzzzzz
1110xxxx 10yyyyyy 10zzzzzz
11110www 10xxxxxx 10yyyyyy 10zzzzzz
where of course the Z's are the low-order bits, Y's the next highest order, X's the next, and W's the highest (well, the bit order can be changed, but I won't get into that for now, see http://unicode.org/faq/utf_bom.html#BOM for more information). Bytes whose two highest bits are 10 are members of a multi-byte string (this enables us to detect if some form of truncation has left us with only a partial character).

UTF-8 string parsers examine a byte to look for the number of bytes to consume to act as a single character. Even though these characters are made up of multiple bytes, they are treated as a single character.

If you are dealing with UTF-8 strings, and you fail to tell ColdFusion that you are doing so with this processing instruction, then it will treat a multi-byte character as a single byte character, which is where string replacements and regular expressions and the like can start creating invalid data. In particular, HTMLEditFormat will break a multi-byte character if you don't have the correct processing instruction set. It will potentially encode individual bytes of a multi-byte character into separate entities (which then get treated as individual characters).

HTML as usual is incredibly forgiving about such things (though the characters may still look like garbage, at least it doesn't explicitly error). XML parsers tend to be incredibly unforgiving about such things, which could explain why you're seeing this when dealing with RSS.

If you correctly implement character encoding in this way, you'll find you don't have to perform black magic with trying to convert characters, ColdFusion (or more specifically Java's string implementation under the hood) will automagically handle all of this for you.

Posted by Eric on Mar 19, 2008 at 12:44 PM


Interesting.

So I do NOT want to hijack this into a BlogCFC thing, but if someone on this thread uses 'high chars' a lot and would like to help me test, let me know. It sounds like all I need is a <cfcontent> tag.

Also, Eric, lets say you are building a feed and have no idea what the data is. You could have Chinese chars. You could have funky MS word chars.

Would you recommend this technique to cover all the bases?

It _sounds_ like we may have a perfect solution here, and if so, I need to quickly blog it and take credit.

Posted by Raymond Camden on Mar 19, 2008 at 12:58 PM


When you are accepting character data from a remote source, you must know what the character encoding is (well, you have to know what it is for your own data too, but fortunately you get to control that). A valid remote source will specify what the encoding is either with a content-type, or via an embedded mechanism as defined by that type of data.

Note that I haven't had perfect success with <cfcontent> with specifying character encoding (in particular, this encoding is likely to get lost if the content is separated from the HTTP headers, such as occurs when you save a HTML file), I strongly suggest you also specify the character encoding with some meta data native to the format you're producing. EG, for XML, <?xml version="1.0" encoding="utf-8" ?>, for HTML, <meta http-equiv="Content-Type" content="text/html;charset=utf-8" >.

Most of the time encoding will be detected automatically for you. By the time the string gets into your hands as a ColdFusion developer, this will have been resolved for you. For example, when the browser submits form data, it gets to specify what encoding it's submitting the data as. ColdFusion will decode this automatically, and effectively (though not really) it treats each character as if it was a 32 bit character. This is transparent to you, and characters are not typically represented in memory as byte arrays, but character arrays (that is to say, memory representation is agnostic to the encoding, and it is converted into an encoding upon output). In reality, that's not actually the case, but the standard specifies that it has to at least behave as if it is.

So the question is, I'm getting data from somewhere, and it might be in almost any encoding, how do I handle this? The answer is that if you're using standard libraries (like ColdFusion's XML parser), it's probably handled for you, and you can ignore it. If you're dealing with raw data (such as if you read a file off the disk which doesn't contain any character encoding hints), it may be more complicated than that.

In a circumstance where metadata is missing about what the character encoding is, it is not necessarily possible to reconstruct the original string correctly. There are really fabulous libraries in the C world which handle this. I'm thinking of iconv and the like. They can guess the character encoding by seeing if the bytes match a specific encoding, but the problem is that it is sometimes possible for a string to be valid in more than one encoding (encoding only talks about how do we represent in 8-bit units character data which cannot natively fit in those 8-bit units). iconv is really good, but has been known to make mistakes, which is why if you have some way to definitively determine the character encoding, you should use it instead (plus this is faster than the tests which are necessary to guess encoding). I don't know if there is a Java equivalent for iconv

Notably, it is an error for someone to provide you multi-byte data without telling you what encoding they are sending it to you in. The same data can be represented in a variety of encodings, and the actual bytes will be different, but the decoded value will be the same.

In the Java world, you can switch through encodings on the fly as you instantiate various objects which might need to output in various encodings, in ColdFusion, you're locked per-page (and you can't set it dynamically, this is a compile-time directive).

By the way, here is a short and simple example script showing that you can have high-value characters:
<cfprocessingdirective pageencoding="utf-8">
<cfcontent type='text/html; charset=utf-8'>
<html>
<head>
<meta http-equiv="Content-Type" content="text/html;charset=utf-8" >
<style type="text/css">
.c {
display: block;
width: 60px;
float: left;
border: 1px solid #DDDDDD;
}
.c b {
margin-right: 5px; border-left: 1px dotted #DDDDDD;
}
</style>
</head>
<body>
<cfoutput>
<cfloop from="1" to="2048" index="x">
<div class="c">#x##chr(x)#</div>
</cfloop>
</cfoutput>
</body>
</html>

Replace all 3 instances of utf-8 with ascii, and see what difference it makes (CF will represent characters which don't fit in the ascii character set as a question mark ? ). Note that I'm not using HTML entities here, I'm just using straight up characters with character values greater than 255.

Also try iso-8859-1 and utf-16, save its output and open it with a hex editor to compare the different encodings for the same data; the bytes will be different, but it will show the same in the browser (as long as it's a character set which supports all the displayed characters). Notice that in UTF-8, when you exceed character 127, the character output starts taking up 2 bytes. (Note, you might not be able to trust your browser to save the byte stream the same way it received it, it might translate it to its preferred encoding before saving, you may have to use a command line tool like netcat or wget).

Posted by Eric on Mar 19, 2008 at 1:26 PM


It may help to pay attention to the "encoding" portion of "character encoding," that is to say, it's like <cfwddx>. You can represent an object in text which is not natively text. Likewise, you can represent a string in 8-bit which uses more distinct characters than 8-bit natively supports.

So you use <cfwddx> to encode your complex data type into text. You use encoding to encode thousands of distinct characters using only 256 distinct bytes. It's encoded. You don't modify it in its encoded form, you modify it in its decoded form, and when you're done, you re-encode it so you can send it across a limited medium.

Once decoded, strings in memory do not remember that they were once UTF-8 or ISO-8859-1. You tell them to become this again when you're ready to transmit them. You tell ColdFusion how to output these strings in a way the browser (or other client) will like with <cfprocessingdirective pageencoding="utf-8"> (which I believe also controls its default handling for data streams with no native encoding specification).

I hope I haven't strayed too far off-topic or posted too many really lengthy diatribes. My coworkers will see this, and know it's my writing even though it doesn't have my last name associated with it (they'd probably know it even if it had no name associated with it).

Posted by Eric on Mar 19, 2008 at 1:35 PM


Well, I know this is old-hat by this point, but I'm finding that a lot of developers don't know much or anything about the difference between a byte and a character, or the difference between Unicode, and encoding that Unicode (such as with UTF-8).

I thought I'd include a link to a recent blog I did about this: http://www.bandeblog.com/2008/05/unicode-absolute-minimum-every.html

This post is just about what Unicode is, and what character encodings mean. I have a follow-up scheduled to this post for tomorrow which talks about UTF-8 specifically.

If you're a developer, especially a web developer, this is essential knowledge, and I'll be happy to try to answer any questions you might have.

Posted by Eric on May 2, 2008 at 11:44 AM


@Eric,

I will check this out. I know that the concept of encoding and different character bytes definitely confuses me. I never learned it, and have never had to learn it too much. I will take a look at your post. Thanks.

Posted by Ben Nadel on May 7, 2008 at 3:32 PM


@Eric

I'm trying to make a demo of your technique for my preso tomorrow but I'm having trouble. Your test script works well. But this sample does not. I don't see ? marks, but other odd marks instead of the proper foreign characters. What am I missing?

<cfprocessingdirective pageencoding="utf-8">
<cfsavecontent variable="strText">
<?xml version="1.0" encoding="utf-8"?>
<text>
Bonjour. Vous êtes très mignon, et je voudrais vraiment
votre prise en mains (ou, à s'emparer de vos fesses,
même si je pense que peut-être trop en avant à ce moment).
</text>
</cfsavecontent>
<cfset strText = trim(strText)>
<cfcontent type="text/xml; charset=utf-8" reset="true"><cfoutput>#strText#</cfoutput>

Posted by Raymond Camden on Jun 17, 2008 at 1:42 PM


@Ray: I'll email you directly since you're up against a deadline and that's probably faster than going back and forth here. We can post a digest here once we get it settled. Just wanted to let you know to look in your email (gmail) in case you missed it.

Posted by Eric on Jun 17, 2008 at 2:14 PM


Post Comment  |  Ask Ben


Home   |   Web Log   |   ColdFusion   |   Projects   |   Resume   |   Job Form   |   Search   |   Contact
Epicenter Consulting - Custom Software Solutions for Business Evolution HostMySite.com - The Leader In ColdFusion Hosting