Skip to main content
Ben Nadel at CFUNITED 2010 (Landsdown, VA) with: Darrell Rapier
Ben Nadel at CFUNITED 2010 (Landsdown, VA) with: Darrell Rapier

Cleaning High Ascii Values For Web Safeness In ColdFusion

By on
Tags:

On Tuesday, Ray Camden came to speak at the New York ColdFusion User Group about RSS feed parsing and creation. During the presentation, he talked about RSS breaking when ColdFusion hits a high ascii value that it doesn't recognize (and it just skips it or something, which will break the XML). This got me thinking about how to remove high ascii values from a string, or at least to clean them to be more web safe. I think we have all done this before, where we have to clean FORM submission data to make sure someone didn't copy and paste from Microsoft Words and get some crazy "smart" characters. In the past, to do this, I have done lots of Replace() calls on the FORM values.

But, Tuesday night, when Ray was speaking, I started, as I often do, to think about regular expressions. Regular expressions have brought so much joy and happiness into my life, I wondered if maybe they could help me with this problem as well. And so, I hopped over to the Java 2 Pattern class documentation to see what it could handle. Immediately, I was quite pleased to see that it had a way to match patterns based on the hexadecimal value of characters:

\xhh - The character with hexadecimal value 0xhh

Since we all know about the ASCII table and how to convert decimal values to hexadecimal values (or at least how to look it up), we can easily put together a regular expression pattern to match high ascii values. The only super, ultra safe characters are the first 128 characters (ascii values 0 - 127); these are HEX values 00 to 7F. Taking that information, we can now build a pattern that matches characters that are NOT in that ascii value range:

[^\x00-\x7F]

With that pattern, we are going to have access not only to the high ascii values we know exist (such as the Microsoft Smart Quotes), we are going to also have access to all the random high ascii values that people randomly enter with their data. This means that we are not going to let anything slip through the cracks.

To encapsulate this functionality, I have create a ColdFusion user defined function, CleanHighAscii():

<cffunction
	name="CleanHighAscii"
	access="public"
	returntype="string"
	output="false"
	hint="Cleans extended ascii values to make the as web safe as possible.">

	<!--- Define arguments. --->
	<cfargument
		name="Text"
		type="string"
		required="true"
		hint="The string that we are going to be cleaning."
		/>

	<!--- Set up local scope. --->
	<cfset var LOCAL = {} />

	<!---
		When cleaning the string, there are going to be ascii
		values that we want to target, but there are also going
		to be high ascii values that we don't expect. Therefore,
		we have to create a pattern that simply matches all non
		low-ASCII characters. This will find all characters that
		are NOT in the first 127 ascii values. To do this, we
		are using the 2-digit hex encoding of values.
	--->
	<cfset LOCAL.Pattern = CreateObject(
		"java",
		"java.util.regex.Pattern"
		).Compile(
			JavaCast( "string", "[^\x00-\x7F]" )
			)
		/>

	<!---
		Create the pattern matcher for our target text. The
		matcher will be able to loop through all the high
		ascii values found in the target string.
	--->
	<cfset LOCAL.Matcher = LOCAL.Pattern.Matcher(
		JavaCast( "string", ARGUMENTS.Text )
		) />


	<!---
		As we clean the string, we are going to need to build
		a results string buffer into which the Matcher will
		be able to store the clean values.
	--->
	<cfset LOCAL.Buffer = CreateObject(
		"java",
		"java.lang.StringBuffer"
		).Init() />


	<!--- Keep looping over high ascii values. --->
	<cfloop condition="LOCAL.Matcher.Find()">

		<!--- Get the matched high ascii value. --->
		<cfset LOCAL.Value = LOCAL.Matcher.Group() />

		<!--- Get the ascii value of our character. --->
		<cfset LOCAL.AsciiValue = Asc( LOCAL.Value ) />

		<!---
			Now that we have the high ascii value, we need to
			figure out what to do with it. There are explicit
			tests we can perform for our replacements. However,
			if we don't have a match, we need a default
			strategy and that will be to just store it as an
			escaped value.
		--->

		<!--- Check for Microsoft double smart quotes. --->
		<cfif (
			(LOCAL.AsciiValue EQ 8220) OR
			(LOCAL.AsciiValue EQ 8221)
			)>

			<!--- Use standard quote. --->
			<cfset LOCAL.Value = """" />

		<!--- Check for Microsoft single smart quotes. --->
		<cfelseif (
			(LOCAL.AsciiValue EQ 8216) OR
			(LOCAL.AsciiValue EQ 8217)
			)>

			<!--- Use standard quote. --->
			<cfset LOCAL.Value = "'" />

		<!--- Check for Microsoft elipse. --->
		<cfelseif (LOCAL.AsciiValue EQ 8230)>

			<!--- Use several periods. --->
			<cfset LOCAL.Value = "..." />

		<cfelse>

			<!---
				We didn't get any explicit matches on our
				character, so just store the escaped value.
			--->
			<cfset LOCAL.Value = "&###LOCAL.AsciiValue#;" />

		</cfif>


		<!---
			Add the cleaned high ascii character into the
			results buffer. Since we know we will only be
			working with extended values, we know that we don't
			have to worry about escaping any special characters
			in our target string.
		--->
		<cfset LOCAL.Matcher.AppendReplacement(
			LOCAL.Buffer,
			JavaCast( "string", LOCAL.Value )
			) />

	</cfloop>

	<!---
		At this point there are no further high ascii values
		in the string. Add the rest of the target text to the
		results buffer.
	--->
	<cfset LOCAL.Matcher.AppendTail(
		LOCAL.Buffer
		) />


	<!--- Return the resultant string. --->
	<cfreturn LOCAL.Buffer.ToString() />
</cffunction>

Here, we are checking for some very specific ascii values (all Microsoft characters), but if we cannot find an explicit match, we do our best to provide a web-safe character by returning the alternate escaped ascii value (&#ASCII;). Let's take a look in this in action:

<!---
	Set up text that has foreign characters. These foreign
	characters are in the "extended" ascii group.
--->
<cfsavecontent variable="strText">
	Bonjour. Vous êtes très mignon, et je voudrais vraiment
	votre prise en mains (ou, à s'emparer de vos fesses,
	même si je pense que peut-être trop en avant à ce moment).
</cfsavecontent>

<!--- Output the cleaned value. --->
#CleanHighAscii( strText )#

We are passing our French text to the function and then outputting the result. Here is what the resultant HTML looks like:

Bonjour. Vous &#234;tes tr&#232;s mignon, et je voudrais vraiment votre prise en mains (ou, &#224; s'emparer de vos fesses, m&#234;me si je pense que peut-&#234;tre trop en avant &#224; ce moment).

Notice that the high ascii values (just the extended ones in our case) were replaced with their safer escaped value counterparts.

As you find more special characters that you need to work with, you can, of course, update the CFIF / CFELSEIF statements in the function; but, until you do that, I think this provides a safer way to handle high ascii values on the web. At the very least, it's cool to see that regular expressions can make our lives better yet again.

Want to use code from this post? Check out the license.

Reader Comments

354 Comments

Very, very nice. Getting the range working was something I had issues with when I tried this last time - but seeing it now it looks so simple! :) I'm going to update toXML later today to include this code.

3 Comments

I wrote these 2 functions for this similar problem. (not sure if they will paste correctly or not):

<samp>
<cffunction name="replaceNonAscii" returntype="string" output="false">
<cfargument name="argString" type="string" default="" />
<cfreturn REReplace(arguments.argString,"[^\0-\x80]","","all") />
</cffunction>

<cffunction name="replaceDiacriticMarks" returntype="string" output="false">
<cfargument name="argString" type="string" default="" />
<!--- Declare retString --->
<cfset var retString = arguments.argString />

<!--- Do Replaces --->
<cfset retString = REReplace(retString,"#chr(192)#|#chr(193)#|#chr(194)#|#chr(195)#|#chr(196)#|#chr(197)#|#chr(913)#|#chr(8704)#","A","all") />
<cfset retString = REReplace(retString,"#chr(198)#","AE","all") />
<cfset retString = REReplace(retString,"#chr(223)#|#chr(914)#|#chr(946)#","B","all") />
<cfset retString = REReplace(retString,"#chr(162)#|#chr(169)#|#chr(199)#|#chr(231)#|#chr(8834)#|#chr(8835)#|#chr(8836)#|#chr(8838)#|#chr(8839)#|#chr(962)#","C","all") />
<cfset retString = REReplace(retString,"#chr(208)#|#chr(272)#","D","all") />
<cfset retString = REReplace(retString,"#chr(200)#|#chr(201)#|#chr(202)#|#chr(203)#|#chr(8364)#|#chr(8707)#|#chr(8712)#|#chr(8713)#|#chr(8715)#|#chr(8721)#|#chr(917)#|#chr(926)#|#chr(931)#|#chr(949)#|#chr(958)#","E","all") />
<cfset retString = REReplace(retString,"#chr(294)#|#chr(919)#","H","all") />
<cfset retString = REReplace(retString,"#chr(204)#|#chr(205)#|#chr(206)#|#chr(207)#|#chr(8465)#|#chr(921)#","I","all") />
<cfset retString = REReplace(retString,"#chr(306)#","IJ","all") />
<cfset retString = REReplace(retString,"#chr(312)#|#chr(922)#|#chr(954)#","K","all") />
<cfset retString = REReplace(retString,"#chr(319)#|#chr(321)#|#chr(915)#","L","all") />
<cfset retString = REReplace(retString,"#chr(924)#","M","all") />
<cfset retString = REReplace(retString,"#chr(209)#|#chr(330)#|#chr(925)#","N","all") />
<cfset retString = REReplace(retString,"#chr(210)#|#chr(211)#|#chr(212)#|#chr(213)#|#chr(214)#|#chr(216)#|#chr(920)#|#chr(927)#|#chr(934)#","O","all") />
<cfset retString = REReplace(retString,"#chr(338)#","OE","all") />
<cfset retString = REReplace(retString,"#chr(174)#|#chr(8476)#","R","all") />
<cfset retString = REReplace(retString,"#chr(167)#|#chr(352)#","S","all") />
<cfset retString = REReplace(retString,"#chr(358)#|#chr(932)#","T","all") />
<cfset retString = REReplace(retString,"#chr(217)#|#chr(218)#|#chr(219)#|#chr(220)#","U","all") />
<cfset retString = REReplace(retString,"#chr(935)#|#chr(967)#","X","all") />
<cfset retString = REReplace(retString,"#chr(165)#|#chr(221)#|#chr(376)#|#chr(933)#|#chr(936)#|#chr(947)#|#chr(978)#","Y","all") />
<cfset retString = REReplace(retString,"#chr(918)#|#chr(950)#","Z","all") />
<cfset retString = REReplace(retString,"#chr(170)#|#chr(224)#|#chr(225)#|#chr(226)#|#chr(227)#|#chr(228)#|#chr(229)#|#chr(945)#","a","all") />
<cfset retString = REReplace(retString,"#chr(230)#","ae","all") />
<cfset retString = REReplace(retString,"#chr(273)#|#chr(8706)#|#chr(948)#","d","all") />
<cfset retString = REReplace(retString,"#chr(232)#|#chr(233)#|#chr(234)#|#chr(235)#","e","all") />
<cfset retString = REReplace(retString,"#chr(402)#|#chr(8747)#","f","all") />
<cfset retString = REReplace(retString,"#chr(295)#","h","all") />
<cfset retString = REReplace(retString,"#chr(236)#|#chr(237)#|#chr(238)#|#chr(239)#|#chr(305)#|#chr(953)#","i","all") />
<cfset retString = REReplace(retString,"#chr(307)#","j","all") />
<cfset retString = REReplace(retString,"#chr(320)#|#chr(322)#","l","all") />
<cfset retString = REReplace(retString,"#chr(241)#|#chr(329)#|#chr(331)#|#chr(951)#","n","all") />
<cfset retString = REReplace(retString,"#chr(240)#|#chr(242)#|#chr(243)#|#chr(244)#|#chr(245)#|#chr(246)#|#chr(248)#|#chr(959)#","o","all") />
<cfset retString = REReplace(retString,"#chr(339)#","oe","all") />
<cfset retString = REReplace(retString,"#chr(222)#|#chr(254)#|#chr(8472)#|#chr(929)#|#chr(961)#","p","all") />
<cfset retString = REReplace(retString,"#chr(353)#|#chr(383)#","s","all") />
<cfset retString = REReplace(retString,"#chr(359)#|#chr(964)#","t","all") />
<cfset retString = REReplace(retString,"#chr(181)#|#chr(249)#|#chr(250)#|#chr(251)#|#chr(252)#|#chr(956)#|#chr(965)#","u","all") />
<cfset retString = REReplace(retString,"#chr(957)#","v","all") />
<cfset retString = REReplace(retString,"#chr(969)#","w","all") />
<cfset retString = REReplace(retString,"#chr(215)#|#chr(8855)#","x","all") />
<cfset retString = REReplace(retString,"#chr(253)#|#chr(255)#","y","all") />
<!--- ' --->
<cfset retString = REReplace(retString,"#chr(180)#|#chr(8242)#|#chr(8216)#|#chr(8217)#","#chr(39)#","all") />
<!--- " --->
<cfset retString = REReplace(retString,"#chr(168)#|#chr(8220)#|#chr(8221)#|#chr(8222)#|#chr(8243)#","#chr(34)#","all") />

<cfreturn retString />
</cffunction>
</samp>

You can call one after the other. I usually call the Trim() function as well. The replaceDiacriticMarks() function replaces character with their "similar' standard ascii values, so è gets turned into e.

15,674 Comments

@Jeff,

Looking pretty cool. I like that you get the "like looking" letters for foreign characters. I can really see where that would be good to have on hand.

3 Comments

@Ben,

I still think it's bad to be doing *that* many replaces, but for my needs it works (seems to perform well/quickly).

The replaceAscii() just gets rid of the chars completely, which is probably worse than what your doing...

I usually just call this function (which uses the 2 other functions I pasted in):

<cffunction name="safeForXml" returntype="string" output="false">
<cfargument name="argString" type="string" default="" />
<cfset var retString = arguments.argString />
<cfset retString = replaceDiacriticMarks(retString) />
<cfset retString = replaceNonAscii(retString) />
<cfset retString = Trim(retString) />
<cfreturn retString />
</cffunction>

15,674 Comments

@Jeff,

That sounds fair to me. That pretty cool that you already knew how to use the HEX values in the regex. I think that's way awesome. I love regular expressions.

I think the only real difference that mine has is that it has a default case that replace any high ascii values that were not accounted for. Other than that, I think we are headed in the same direction.

111 Comments

Rather than doing all of those cfsets and replaces, could you perhaps return an array of matched high ascii characters (just the numeric values), then loop over that list and replace those values with matching values from another array of replacements?

e.g.
<cfloop array="#matchingHighAscii#" index="value">
<cfset myString = Replace(myString, "&#" & value & ";", lowAscii[value], "ALL")>
</cfloop>

5 Comments

Wow, very timely entry, Ben! Yesterday I discovered a need to convert extended ASCII characters into their HTML encoded entities... and then I saw this post in my feed reader. Thanks. This is perfect.

2 Comments

Thanks guys for all your comments. I have been working at a solution of my own, inspired by some of the solutions here. I specifically wanted to convert all high characters with a value greater then 127 into XML safe hexadecimal characters. I came up with a small solution that doesn't rely on any external objects yet was based on a regular expression.

It can be found at my blog ..
http://ipggi.wordpress.com/2008/03/11/remove-or-clean-high-extended-ascii-characters-in-coldfusion-for-xml-safeness/

3 Comments

Handling unicode bytes is unfortunately a little more complicated than stripping "bad" characters.

The issue is that multi-byte characters may start with a byte outside of the 0x00-0x7F range, but may then be composed of 1-3 additional bytes which are within this range. So when you strip the "bad" bytes, you're snagging only the first byte out of a multi-byte character. It can lead to garbage characters in your output.

You're also left completely without the ability to handle many foreign languages, your code can only handle latin characters, and that's just too restrictive for many folks.

UTF-8 is the most common international multi-byte character encoding since it's incredibly comprehensive (UTF-16 is yet more so, but I believe it mostly only includes traditional Chinese characters [1 distinct character for each possible word] over UTF-8, and even most Chinese sites go with modern Chinese).

You'll discover that putting <cfprocessingdirective pageencoding="utf-8"> in the first 1024 bytes of each template will cause CF and all of CF's string functions to correctly handle "high-ascii" characters. You will also need to alert the browser (or other consumer of your data) what the character set is. There are various methods to do this, depending on what type of content you're sending. For most types of content, a simple way is to use <cfcontent type="(content-type such as text/html); charset=utf-8">. There are also HTML head entries which can specify the character encoding, and an XML processing instruction for this purpose too (<?xml version="1.0" encoding="utf-8" ?>) which would cover your RSS feeds.

Man, I worked long and hard on getting regular expressions to sanitize my data, and the outcome of that is that it's simply virtually impossible with regular expressions, you really need a string system which treats multi-byte characters as a single character, and the way to do that in ColdFusion is with the aforementioned processing directive.

354 Comments

Eric, it seems like you are saying that simply adding the cfcontent, and the charset in the <xml> tag, would solve _everything_. Is that true? If so - that doesn't seem to be what I'm seeing locally. I know BlogCFC does this (although it uses cfprocessingdirective instead of specifying it on cfcontent) and folks still have issues with the 'high ascii'.

39 Comments

You have to both do a <cfprocessingdirective> (this tells ColdFusion what character set it's working with), and also inform the browser (or other client) what character set you're sending them (since once you set the character encoding with <cfprocessingdirective>, it will send characters to the client in that encoding for any output initiated from that template). Both sides have to know what the encoding is, and have to agree.

Also remember that <cfprocessingdirective> has to appear at the top of every template which may be doing any work with multi-byte strings (whether with string-related functions in ColdFusion, or outputting multi-byte strings to the browser). By multi-byte strings, I mean anything dealing with characters over 0x7F (character 128 and beyond, right well past character 256). Notably for UTF-8, character 128 uses two bytes, but does not have to be represented with the HTML entity €.

For more information on how to inform various clients which character encoding you're sending them, see this W3C article: http://www.w3.org/International/O-charset

I usually stick to pure UTF-8, it has never been insufficient for my needs. Ideally you'd want this to be some sort of application configuration preference, but in reality that may be unrealistic to your code, and as I mentioned earlier, UTF-8 will cover 99.9% of what anyone in the world would want to do; those who want to do something which requires an even larger character set typically already know how to do so.

So a little background.
The way that UTF-8 works is that the high-bit of the first byte in a given character indicates whether it is a member of a multi-byte character.

In binary, single characters may be made up of the following byte sequences:
0zzzzzzzz
110yyyyy 10zzzzzz
1110xxxx 10yyyyyy 10zzzzzz
11110www 10xxxxxx 10yyyyyy 10zzzzzz
where of course the Z's are the low-order bits, Y's the next highest order, X's the next, and W's the highest (well, the bit order can be changed, but I won't get into that for now, see http://unicode.org/faq/utf_bom.html#BOM for more information). Bytes whose two highest bits are 10 are members of a multi-byte string (this enables us to detect if some form of truncation has left us with only a partial character).

UTF-8 string parsers examine a byte to look for the number of bytes to consume to act as a single character. Even though these characters are made up of multiple bytes, they are treated as a single character.

If you are dealing with UTF-8 strings, and you fail to tell ColdFusion that you are doing so with this processing instruction, then it will treat a multi-byte character as a single byte character, which is where string replacements and regular expressions and the like can start creating invalid data. In particular, HTMLEditFormat will break a multi-byte character if you don't have the correct processing instruction set. It will potentially encode individual bytes of a multi-byte character into separate entities (which then get treated as individual characters).

HTML as usual is incredibly forgiving about such things (though the characters may still look like garbage, at least it doesn't explicitly error). XML parsers tend to be incredibly unforgiving about such things, which could explain why you're seeing this when dealing with RSS.

If you correctly implement character encoding in this way, you'll find you don't have to perform black magic with trying to convert characters, ColdFusion (or more specifically Java's string implementation under the hood) will automagically handle all of this for you.

354 Comments

Interesting.

So I do NOT want to hijack this into a BlogCFC thing, but if someone on this thread uses 'high chars' a lot and would like to help me test, let me know. It sounds like all I need is a <cfcontent> tag.

Also, Eric, lets say you are building a feed and have no idea what the data is. You could have Chinese chars. You could have funky MS word chars.

Would you recommend this technique to cover all the bases?

It _sounds_ like we may have a perfect solution here, and if so, I need to quickly blog it and take credit.

39 Comments

When you are accepting character data from a remote source, you must know what the character encoding is (well, you have to know what it is for your own data too, but fortunately you get to control that). A valid remote source will specify what the encoding is either with a content-type, or via an embedded mechanism as defined by that type of data.

Note that I haven't had perfect success with <cfcontent> with specifying character encoding (in particular, this encoding is likely to get lost if the content is separated from the HTTP headers, such as occurs when you save a HTML file), I strongly suggest you also specify the character encoding with some meta data native to the format you're producing. EG, for XML, <?xml version="1.0" encoding="utf-8" ?>, for HTML, <meta http-equiv="Content-Type" content="text/html;charset=utf-8" >.

Most of the time encoding will be detected automatically for you. By the time the string gets into your hands as a ColdFusion developer, this will have been resolved for you. For example, when the browser submits form data, it gets to specify what encoding it's submitting the data as. ColdFusion will decode this automatically, and effectively (though not really) it treats each character as if it was a 32 bit character. This is transparent to you, and characters are not typically represented in memory as byte arrays, but character arrays (that is to say, memory representation is agnostic to the encoding, and it is converted into an encoding upon output). In reality, that's not actually the case, but the standard specifies that it has to at least behave as if it is.

So the question is, I'm getting data from somewhere, and it might be in almost any encoding, how do I handle this? The answer is that if you're using standard libraries (like ColdFusion's XML parser), it's probably handled for you, and you can ignore it. If you're dealing with raw data (such as if you read a file off the disk which doesn't contain any character encoding hints), it may be more complicated than that.

In a circumstance where metadata is missing about what the character encoding is, it is not necessarily possible to reconstruct the original string correctly. There are really fabulous libraries in the C world which handle this. I'm thinking of iconv and the like. They can guess the character encoding by seeing if the bytes match a specific encoding, but the problem is that it is sometimes possible for a string to be valid in more than one encoding (encoding only talks about how do we represent in 8-bit units character data which cannot natively fit in those 8-bit units). iconv is really good, but has been known to make mistakes, which is why if you have some way to definitively determine the character encoding, you should use it instead (plus this is faster than the tests which are necessary to guess encoding). I don't know if there is a Java equivalent for iconv

Notably, it is an error for someone to provide you multi-byte data without telling you what encoding they are sending it to you in. The same data can be represented in a variety of encodings, and the actual bytes will be different, but the decoded value will be the same.

In the Java world, you can switch through encodings on the fly as you instantiate various objects which might need to output in various encodings, in ColdFusion, you're locked per-page (and you can't set it dynamically, this is a compile-time directive).

By the way, here is a short and simple example script showing that you can have high-value characters:
<cfprocessingdirective pageencoding="utf-8">
<cfcontent type='text/html; charset=utf-8'>
<html>
<head>
<meta http-equiv="Content-Type" content="text/html;charset=utf-8" >
<style type="text/css">
.c {
display: block;
width: 60px;
float: left;
border: 1px solid #DDDDDD;
}
.c b {
margin-right: 5px; border-left: 1px dotted #DDDDDD;
}
</style>
</head>
<body>
<cfoutput>
<cfloop from="1" to="2048" index="x">
<div class="c">#x##chr(x)#</div>
</cfloop>
</cfoutput>
</body>
</html>

Replace all 3 instances of utf-8 with ascii, and see what difference it makes (CF will represent characters which don't fit in the ascii character set as a question mark ? ). Note that I'm not using HTML entities here, I'm just using straight up characters with character values greater than 255.

Also try iso-8859-1 and utf-16, save its output and open it with a hex editor to compare the different encodings for the same data; the bytes will be different, but it will show the same in the browser (as long as it's a character set which supports all the displayed characters). Notice that in UTF-8, when you exceed character 127, the character output starts taking up 2 bytes. (Note, you might not be able to trust your browser to save the byte stream the same way it received it, it might translate it to its preferred encoding before saving, you may have to use a command line tool like netcat or wget).

39 Comments

It may help to pay attention to the "encoding" portion of "character encoding," that is to say, it's like <cfwddx>. You can represent an object in text which is not natively text. Likewise, you can represent a string in 8-bit which uses more distinct characters than 8-bit natively supports.

So you use <cfwddx> to encode your complex data type into text. You use encoding to encode thousands of distinct characters using only 256 distinct bytes. It's encoded. You don't modify it in its encoded form, you modify it in its decoded form, and when you're done, you re-encode it so you can send it across a limited medium.

Once decoded, strings in memory do not remember that they were once UTF-8 or ISO-8859-1. You tell them to become this again when you're ready to transmit them. You tell ColdFusion how to output these strings in a way the browser (or other client) will like with <cfprocessingdirective pageencoding="utf-8"> (which I believe also controls its default handling for data streams with no native encoding specification).

I hope I haven't strayed too far off-topic or posted too many really lengthy diatribes. My coworkers will see this, and know it's my writing even though it doesn't have my last name associated with it (they'd probably know it even if it had no name associated with it).

41 Comments

Well, I know this is old-hat by this point, but I'm finding that a lot of developers don't know much or anything about the difference between a byte and a character, or the difference between Unicode, and encoding that Unicode (such as with UTF-8).

I thought I'd include a link to a recent blog I did about this: http://www.bandeblog.com/2008/05/unicode-absolute-minimum-every.html

This post is just about what Unicode is, and what character encodings mean. I have a follow-up scheduled to this post for tomorrow which talks about UTF-8 specifically.

If you're a developer, especially a web developer, this is essential knowledge, and I'll be happy to try to answer any questions you might have.

15,674 Comments

@Eric,

I will check this out. I know that the concept of encoding and different character bytes definitely confuses me. I never learned it, and have never had to learn it too much. I will take a look at your post. Thanks.

354 Comments

@Eric

I'm trying to make a demo of your technique for my preso tomorrow but I'm having trouble. Your test script works well. But this sample does not. I don't see ? marks, but other odd marks instead of the proper foreign characters. What am I missing?

<cfprocessingdirective pageencoding="utf-8">
<cfsavecontent variable="strText">
<?xml version="1.0" encoding="utf-8"?>
<text>
Bonjour. Vous êtes très mignon, et je voudrais vraiment
votre prise en mains (ou, à s'emparer de vos fesses,
même si je pense que peut-être trop en avant à ce moment).
</text>
</cfsavecontent>
<cfset strText = trim(strText)>
<cfcontent type="text/xml; charset=utf-8" reset="true"><cfoutput>#strText#</cfoutput>

39 Comments

@Ray: I'll email you directly since you're up against a deadline and that's probably faster than going back and forth here. We can post a digest here once we get it settled. Just wanted to let you know to look in your email (gmail) in case you missed it.

2 Comments

If your trying to use special/high characters it is best not to copy them from other sources as they will probably be encoded in their own character sets which are often incompatible with each other. This is why you return a blank space. For example by default ColdFusion encodes pages in Western (Latin 1) / ISO-8859-1 which is completely different to Unicode.

I suggest to use the character you want. Run the Windows application 'Character Map'. Select the font you are planning to use in your page such as Arial, and then change the 'character set' to 'Windows: Western'.

From there find the • character you are after and 'select' 'copy' it to your clipboard and then paste it into your page.

2 Comments

Well you will still have to use the character map to find out the uni character code. For example

• is U+01BE

So in HTML4 you would escape it by using ƾ

® is U+00AE so that would be ® in HTML4

2 Comments

Oops that didn't display correctly

• is U+2022

So in HTML4 you would escape it by using AMP#x2022;

® is U+00AE so that would be AMP#xAE; in HTML4

(replace AMP with an &)

3 Comments

Ok, I guess my thing is, that I can escape that character, but there are other characters that come up from time to time so its not just that. Is there a regex statement that says only allow letters, numbers, characters that show up on a PC keyboard?

15,674 Comments

@Michael,

Yeah, you can do that. In the above example, I am using a regular expression that finds characters NOT in that range:

[^\x00-\x7F]

To get the characters that DO fall in that range, just remove the "not":

[\x00-\x7F]

2 Comments

@Raymond Camden,

Can you please tell me if you have a solution for this.

<cfprocessingdirective pageencoding="utf-8">
<cfset myString = "The Islamic Republic of Mauritania's (République Islamique de Mauritanie) 2007 estimated population is 3,270,000. Also check Côte d'Ivoire">

<cfset myNewString = xmlFormat(myString)>

<cfoutput>#myNewString#</cfoutput>

2 Comments

@Raymond Camden,

I am sorry for that.. i am trying to ask if you have the solution for this.

<cfprocessingdirective pageencoding="utf-8">
<cfsavecontent variable="strText">
<?xml version="1.0" encoding="utf-8"?>
<text>
Bonjour. Vous êtes très mignon, et je voudrais vraiment
votre prise en mains (ou, à s'emparer de vos fesses,
même si je pense que peut-être trop en avant à ce moment).
</text>
</cfsavecontent>
<cfset strText = trim(strText)>
<cfcontent type="text/xml; charset=utf-8" reset="true"><cfoutput>#strText#</cfoutput>

39 Comments

@Ramakrishna
It looks like you'll have a new line before your processing instruction.

These would need to be on the same line:
<cfsavecontent><?xml

Also you need to be sure that you're saving the file in UTF-8 if you're telling <cfprocessingdirective> that the file is UTF-8 (check your editor preferences). Personally I use UTF-8 for absolutely everything. It covers the entire Unicode character set, and uses one byte per character for the majority of the characters typically found in output for most western languages.

1 Comments

Ben you rock! I have been banging my head on my desk for months and poof you fixed it for me!
THANKS!!!!!!!!!!!!!!

22 Comments

I am confused about a result I am getting after experimenting with Ben's code. I am trying to insert the following into a sql db and jam it in an xml doc:
S.à r.l.

If I use Ben's code and <cfoutput>#CleanHighAscii( strText )#</cfoutput>, then there is no issue. The output and viewing the page source are as expected and lovely.

However, after it is inserted into the database AND if you cfdump the above variable, rather than cfoutput, I get the following:
S.à r.l.
If you view the source on the above it is:
S.&#224;

For some reason, after inserting into the db and during a cfdump, the & get replaced with &

Boy, I would appreciate any help or comments. Very frustrating.

39 Comments

Robert, Ben's code replaces any character over U+007F (anything over the first 128 characters) with the { equivalent. Your à character is one such character, and encodes as à CFDumping a string is essentially equivalent to outputting the HTMLEditFormat() for the same string. The character doesn't exist in the string any longer, only the HTML entity equivalent.

39 Comments

So that the data in the SQL is not the HTML entity encoded format? There is much to learn about character encodings to adequately debug where character encoding may be going wrong. The first thing you might consider checking though is that you have "String Format: Enable High ASCII characters and Unicode for data sources configured for non-Latin characters" enabled in your data source. You'll also need to be sure your SQL Server is configured to accept Unicode (eg, you store values as UTF-8 or UTF-16). Plus you'll want to be certain that you're storing the values in a column which can contain Unicode (such as nvarchar2).

Weird things start to happen when any part of the stack doesn't support extended characters. When in doubt make everything UTF-8.

Also spend some time reading up on Unicode and UTF-8 (shameless plug):
http://www.bandeblog.com/2008/05/unicode-the-absolute-minimum-every-developer-should-know/
http://www.bandeblog.com/2008/05/how-utf-8-encoding-works/

If you spend any time working with International languages at all, this stuff is absolutely required knowledge. The alternative is banging on it till it works, not understanding why it works, and later discovering it only works part way.

22 Comments

Thanks Eric! Already, those first two tips were brand new to me.

I'll start to edu-ma-cate mines self. :)

Thanks again.

22 Comments

An observation for future generations.

I don't think my issue is related to anything I've talked about above. Rather, it seems XmlFormat() is escaping the string twice. So what should be an ampersand is become ampersandSemicolon

1 Comments

While I have used Ben's solution to resolve some issues with those mysterious Microsoft characters, I also tried some experimentation with UTF-8. As my clients use a WYSIWYG Editor to enter content, they will often copy straight from MS Word even with a special button to remove said characters.

As much as I try to use the UTF-8 suggestion from Eric, I can't get it to work. I was, however, able to get my websites to read correctly if I used iso-8859-1.

Being that UTF-8 is the preferred. What do I need to do to get it to work with my websites?

39 Comments

David, you might try out the "setEncoding" function in ColdFusion: http://livedocs.adobe.com/coldfusion/7/htmldocs/wwhelp/wwhimpl/common/html/wwhelp.htm?context=ColdFusion_Documentation&file=00000623.htm, the problem as you've probably discovered is that the browser is most likely submitting data on the form as ISO-8859-1, and when you regurgitate the same bytes but declare them to be UTF-8 encoding, you get some odd characters.

So something like setEncoding("FORM", "ISO-8859-1") then otherwise using UTF-8 for everything else should cause ColdFusion to have correctly interpreted the values in their native encoding.

15,674 Comments

@David,

I will defer to @Eric on this. Encoding as a concept is not something which I have fully wrapped my head around yet.

7 Comments

Ben you've saved the day again! I've been struggling with those odd Word characters for about a week now!

Although I'm still running on MX2004 (we'll upgrade soon! Oh how I long for the ++ operator) so I had to make a few changes, it's still a wonderful function!

Thanks again!

1 Comments

After reading this thread i have no idea what is going on with charset encoding or what to do about outputting cms data. I dont see how you guys have time for this stuff...

15,674 Comments

@Pete,

Good stuff, glad to help.

@Rob,

I totally relate; character encoding is something that I only have a superficial understanding of. I only just recently started using UTF-8 on my databases! Then there's the page encoding and the HTML Meta encoding. Ug, it's still a bit murky for me.

2 Comments

I'm now using utf-8 and unicode in my databases and coldfusion routines. The one thing that isn't working is to use unicode characters in cfm filenames.

Loading such a page returns the CF error report: "File not found: /??.cfm ". The ? mark is where each unicode character is found.

I'm using IIS7.5 and CF8. Unicode filenames are served up by IIS7.x .

Anyone else run into this? Is this possible with CF8 (on IIS)?

6 Comments

@Dan,
You say that your using utf-8 and unicode in your database. Are your database columns nvarchar instead of varchar? Check your meta tags, make sure your charset=utf-8.

2 Comments

Hey Paul and Ben. Thanks for your thoughts on this.

Everything is working beautifully on the system; nice unicode characters everywhere. Except in URL referring to cfm files.

I have since tested unicode character in included templates and that doesn't work either (error showed three funny characters in place of the one Chinese one). This aspect is much less important than getting these characters into the url.

My conclusion to-date is CF8 can't do it. IIS7x can.

For others moving this way I'd add a couple of points to some of the steps detailed in the thread above:

- SQL statements need the N prefix on all string assignments.
- cf templates need the BOM to work; utf-8 only is insufficient. Dreamweaver let you set this.
- Read files (eg cffile) with the charset="utf-8". Writing files with unicode filenames doesn't work with cffile - the BOM is not set this way. You will need an alternative, eg java file output, where you can insert the BOM.

1 Comments

Ben,

Thank you for your elegant solution. I was just sitting down to address a similar issue, however, the Web Site only needs to deal with English.

I have a customer group who copied and pasted many Microsoft documents from several applications into a Client/Server database over many years. The database is going away and the customer group wants the information transitioned to a database that can be queried from the Web. In my case my plan is to create data files from the current database. Then parse the files to eliminate all non-ASCII (Microsoft) characters. The purified files would then be loaded into the database we use with our Web Site.

Thank you also for the UTF-8 discussion, I will ensure that I do not fall into the multi-byte trap discussed above.

15,674 Comments

@Jacques,

Sounds like you'll have some fun scripting ahead of you! I'm glad that the UTF stuff is helping you out. I feel like I am finally getting better at that; but, it's still a bit dodgy in my mind.

22 Comments

Hey Ben,

I used the information from your post to write a sanitization function for a C# web service yesterday, but I was still having problems.

It turns out the entire range [\x00-\x7F] isn't safe. For example, a vertical tab (\x0B) causes all 3 XML parsers I tried (ColdFusion, Firefox, Chrome) to choke.

I modified the regex to only include the printable characters (\x20-\x7E), plus carriage return (\x0D) and horizontal tab (\x09), for a new regex of:

[^\x20-\x7E\x0D\x09]

So far, this has worked flawlessly for me.

15,674 Comments

@Adam,

Ah very nice! Silly vertical tab! Who uses those anyway :)

On an unrelated note, someone just pointed me to your Taffy framework post - looks like some very cool stuff. I'll be peeking at the implementation looking for some inspiration.

1 Comments

This is an awesome discussion. I only wish I had found it months ago as I was working my way through understanding character encoding.

I'm having a strange issue with saving my unicode data to a database. In general it's been working fine (pretty much doing what has been discussed here already using utf-8) and I thought I had this licked.

I'm using cf8 and oracle10g. Saving xml to nclob in an oracle table using cfqueryparam/sql_type_clob.

Whats happening is that sometimes my unicode chars are getting converted to ascii and I'm getting those nasty invalid xml character errors. I figured out whats going on but not why and how best to fix it.

If the total length of my xml is less that 4k the bad ascii conversion happens. Anything over 4k and it saves to the database correctly.

I played around with some explicit settings (as described in the above discussion) but it didn't help. The encoding seems to be correct up until the sql update. I assume there is some internal conversion/typing going on with cfqueryparam that is causing this but I haven't been able to find anyone else describing this same issue.

Anyone got any suggestions? The quick and cheesy fix is me appending a fake element of 3k to the xml before saving. That works! But obviously not the ideal solution.

Thanks

15,674 Comments

@Ken,

That's strange that it would depend on the length of the inserted data. I don't really know much about CLOB data types, but it sounds like something is either going wrong with the data param'ing or with the insert. I'm stumped.

9 Comments

@Gareth, @ben, and @Ray Thanks for all your ideas.

I had to read (and manipulate) an html file that was had some special higher characters in it, like and and ? and - and " and ". (I thing some of the html was cut and paste from microsoft word.) I converted the html to cleaner xhtml using jtidy, then to xml so I could manipulate data easier. Using Coldfusion 8, xmlparse() did not like some of these higher characters (forgot the error). Using the information from this post, I came up with the following code. By posting it here, hopefully I can save someone else some time.

<!--- remove ALL special high bit characters, convert into "ampersand pound number number number semicolon" ---
--- see http://www.octadyne.com/html_entity_acsii_table.cfm for good list of ascii codes conversions
--->
	<cffile variable="htmlStr1" action="read" file="#HtmlFileNameStr1#" charset="utf-8" />
	<cfset matchingHighAsciiArr = REMatch("[^\0-\x80]", htmlStr1) >
	<cfloop array="#matchingHighAsciiArr#" index="chStr">
		<cfif len(chStr) neq 1>
			<cfabort showerror="ERROR: chStr is '#chStr#' ">
		</cfif>
		<cfset iAscVal= asc(chStr) >
		<cfif (iAscVal eq 160) >
			<cfset replacementStr= "&nbsp;">
		<cfelse>
			<cfset replacementStr= "&###asc(chStr)#;">
		</cfif>
		<cfset myString = Replace(htmlStr2, #chStr#, replacementStr, "ALL")>
	</cfloop>
15,674 Comments

@Dangle,

How nice is it that there's a way to represent high-ascii characters with the &#ASC; notation. That has saved me a few times. Glad that this got you through the XML-parse stuff. I have, howver, run into experiences (I think) where the ampersand also needs to be escaped in these special character notations. Of course, I could very easily have been messing something else up. I can never remember all the XML-parsing rules.

9 Comments

I ran into that ampersand problem in converting html to xml also. However, as I read somewhere, the solution is to temporary change ampersands symbols to the allowed xml's "amperand a m p semicolon" before sending to xmlparse() using:

cfset beforeParseStr= Replace( htmlstr, "&", "&amp;", "all")

THEN after done with xmlparse() and all manipulations, put ampersands back.

cfset afterParseStr= Replace(
ToString(XmlObj), "&amp;", "&", "all")

PS: Ben, Thanks for the very useful site -- I use it alot. :-) Dan

39 Comments

The reason you're having difficulties with named entities like &rsquo; not being recognized when parsing as XML is that unlike HTML, XML only comes with three built in named entities (&lt;, &gt;, and &amp;) What you're doing is actually double-escaping those entities when you do &amp;rsquo;. This probably works because you're effectively telling the XML parser that there's a literal character sequence: "&rsquo;" instead of letting the XML parser handle the entity for you.

Like some other comments in this thread, this is also an encoding issue. &rsquo; is a named entity which maps to Unicode character 2019 (aka U+2019). Semantically, &rsquo; &#x2019; and &#8217; are identical to each other. In fact if the XML document is set up to recognize the named entity, then in most XML parsers, once the document is parsed, it's impossible to know which entity form they used. It's also not possible to tell the difference between one of these, and a properly encoded representation (such as UTF-8 if that's what you're using).

It is totally possible to use traditional named entities in an XML document. There are two ways to do so, one is to inherit a DTD which defines those entities. For example: PUBLIC "-//W3C//ENTITIES Latin 1//EN//HTML"

Another way is to identify and define those entities you wish to use, and include them explicitly. I won't post the entire HTML named entity list here, it's somewhat long. If anyone would like it, send me an email and I'll send it to you when I can (I'm at Adobe MAX this week - if you're there and you see a bald guy with a beard, come say hi: www.bennadel.com/index.cfm?site-photo=80).

Here's a short example of what it looks like to include the definition of &rsquo; in your document (where RootNodeName is the name of your document element). You'll notice that rsquo is defined as *being* one of the alternate representations I mentioned above (&#8217;).

<!DOCTYPE RootNodeName [
<!ENTITY rsquo "&#8217;">
]>

The beauty of this approach is that you can declare your own named entities.
<!ENTITY eric "Eric, knower of character encoding">

In that XML document, if you typed: This guy sure thinks a lot of himself: "&eric;", it would parse as: This guy sure thinks a lot of himself: "Eric, knower of character encoding" There would in fact not be an easy way to return to the original shorter &eric; form.

We use XML and XSL to do page layout (XML is our model, and XSL is our view essentially), and this allows us to create short hand declarations for some complex entities. For example, on one site we use a small icon which represents a unit of measure. It comes with alt text, a little javascript for a tooltip, and so forth. Instead of having to type out the full markup for that each time (it's used a LOT), we just include &uom;, and it expands automatically. We also have a business requirement that all registered trademark entities are superscripted. So we declared &reg; to be <sup>&reg;</sup>.

39 Comments

Oops, said "email me" and didn't give my address. mightye~gmail.com

Also, &reg; isn't self referential in our code like I said in my last paragraph, it's actually this:
<!ENTITY reg "<sup>&#174;</sup>">

Don't know what would happen if you created a referential loop in entities like that. Probably would depend on the XML parser, but you'd either get a stack overflow, or the parser would handle it more gracefully and warn you of the referential loop.

18 Comments

Thanks Ben for the article and thanks to Adam for enhancing the function.

I have just used it in my code to handle/ make safe xml text for my XML document.

Thanks again.

Philip

15,674 Comments

@Eric,

That is some awesome stuff. I had no idea that "&" was so special in XML. I just always thought of it as something that messed up my parsing :) Very very cool stuff and thanks for the explanation.

@Philip,

Awesome my man :)

18 Comments

Hi Ben,

Adobe introduced XmlFormat(string) to escape all special characters including High Ascii one.

Is it safe to use the above function to make XML safe?

Thanks

Philip

1 Comments

Always, really allways when I'm looking for some CF stuff I'm redirected from Google to your site.

You're the man - keep em rockin!

2 Comments

Hey Ben,

Just thought you might want to know, it looks like this guy stole your article:

https://clipr.torchbox.com/154/

Thanks for the great info. I'm working on a ColdFusion Builder extension and testing with the high ASCIIvalues 8220 and 8221 (Microsoft "smart" quotes).

CF Builder seems to be botching those characters (they appear as question marks) before my handler page can do anything about them.

Dumping the XML variable (ideventinfo) in my handler (before any processing by the handler) gives me the question marks in place of the smart quotes. I don't see any way to set the encoding to remedy this, though.

15,674 Comments

@Andy,

Hmm, not sure what that other site is. I'm not gonna worry about it for now. As for encoding, there's a few places you can set encoding - with HTML Meta tags and with the CFProcessingDirective tag; perhaps throwing UTF-8 into one or both of those places will help with the question marks.

@Wouter,

Glad to help.

1 Comments

thanks! this even solved my php problem by using your given ascii range in preg_replace();. i tried other ways of removing non-standard ascii from input, but this one worked for me.

39 Comments

@vector, for PHP, I recommend looking into either iconv() or mb_convert_encoding() http://php.net/mb_convert_encoding. For example:

$text = mb_convert_encoding($text, 'UTF-8', mb_detect_encoding($text));

Most browser-submitted content is going to originally be in ISO-8859-1, so the majority of the time using this will also get you correct results:

$text = mb_convert_encoding($text, 'UTF-8', 'ISO-8859-1');

I recommend using UTF-8 for pretty much everything, from the data at rest in your database or filesystem, to the character encoding of your page output.

You can also use filter_vars if you just want to drop extended characters outright:

$text = filter_var($text, FILTER_FLAG_STRIP_LOW | FILTER_FLAG_STRIP_HIGH);
// or
$_POST = filter_var_array($_POST, FILTER_FLAG_STRIP_LOW | FILTER_FLAG_STRIP_HIGH);
2 Comments

It would be nice to "globally" or "automatically" sanitize all data that is persisted to the DB. Since I'm using ORM, I'm fooling around the preUpdate/preInsert events, but I'm wondering if anyone else has already tackled this? My JSON returns from AJAX calls show an "invalid JSON format" error if any of the data contains special characters.

39 Comments

It's probably considered bad practice by some, but we've globally sanitized data in Application.cfc's onRequestStart() method. We update the values of URL and FORM directly so that these values are sanitized for anything downstream which might want them.

We have the policy that anything that goes into the database must be UTF-8, if non-UTF-8 data gets in there, that's where the bug lies (rather than in the code which outputs non-UTF-8 data). It's just much harder to always sanitize outputs than to always sanitize inputs.

FWIW, the reason you get "invalid JSON format" is because the JSON spec specifies that data must be encoded as UTF-8. No other encoding is acceptable. For the most part, browsers in the US and Europe are submitting data in ISO-8859-1 (LATIN-1), so if you're using JSON, you need to be careful to be sure user input encoding is sanitized.

2 Comments

When I added the following code to onRequestStart in app.cfc, it stripped out the special characters:

for (key in URL) {
if (not isJSON(URL[key])) {
	URL[key] = REReplace(URL[key],'[^\x20-\x7E\x0D\x09]','','all');
}
}
 
 
for (key in FORM) {
if (not isJSON(FORM[key])) {
	FORM[key] = REReplace(FORM[key],'[^\x20-\x7E\x0D\x09]','','all');
}
}

This is using the aforementioned regex that Adam Tuttle came up with. Is this what you had in mind Eric? Or is there a better way to detect if something is not UTF-8 compliant?

18 Comments

Eric, if you're still out there, would you consider this page authoritative for CF8+ on how to support i18n for CF?

http://mysecretbase.com/ColdFusion_and_Unicode.cfm

It seems to incorporate what you've outlined above except that the comments here are from 2008 so I don't know if/how things have changed.

What's interesting is that we haven't done any of these things but with CF8 we seem to be handling international characters just fine?

2 Comments

Just on my own CF9 setup, to get ascii, unicode and html brackets sorted for XML, I took the following from the above:

<cfsavecontent variable="str">
<p>Glendaloch's ( in Irish: Gleann Dá Loch ) ,<br /> is noting like Japanese Nemawashi (???) or even the French River Vézère </p>
</cfsavecontent>
<cfoutput>#str#</cfoutput>
<!--- with Eric's cfprocessingdirective pageencoding="utf-8" top of page  --->
				 
<cfset uniCodeOk = createObject("component","cfcmap.compName").init()> <!--- want to use component Ben's function --->
<cfset unicoded =uniCodeOk.cleanHighAscii(str)>
<!---  Ben Nadell's function converts  japanese,french,irish chars to unicode but not the lone ' in the string --->
<cfoutput><p>#unicoded#</p></cfoutput>
<cfset htmlSorted=xmlformat(unicoded)>
<!--- sorts out HTML brackets, yes, but unfortuately also tackles & in unicode chars --->
<cfset  ampsOutStr= replace(htmlSorted,"&amp;","&","All")>
<!--- take out &amp; --->           <cfoutput>#ampsOutStr#</cfoutput>

I do not understand why unicode characters come back from Ben's function and the lone apostrophe gets sorted by xmlFormat function.

1 Comments

Hi Ben,

I am using CleanHighAscii function in my application. Its working great with the special characters issue but its resulting in cross-site scripting problem. I tried HTMLEditFormat(), to overcome the xss problem, while displaying but it displays the hex code in the front end. Please help me out on this issue.

Thanks & Regards,
Naresh.

I believe in love. I believe in compassion. I believe in human rights. I believe that we can afford to give more of these gifts to the world around us because it costs us nothing to be decent and kind and understanding. And, I want you to know that when you land on this site, you are accepted for who you are, no matter how you identify, what truths you live, or whatever kind of goofy shit makes you feel alive! Rock on with your bad self!
Ben Nadel