Yesterday, I started running into an interesting issue when using the
canonicalize() function in Lucee CFML to normalize the encoding of a given URL. The result of the
canonicalize() call was leaving some URL search-parameters intact while corrupting others. At first, it seemed completely random. But, after digging into it for a bit, I realized that the
canonicalize() call was decoding substrings within the URL that looked like HTML entities. To demonstrate, I was able to isolate the issue in Lucee CFML 220.127.116.11.
According to the MDN (Mozilla Developer Network) Docs on HTML Entity, and Entity is:
An HTML entity is a piece of text ("string") that begins with an ampersand (
&) and ends with a semicolon (
;) . Entities are frequently used to display reserved characters (which would otherwise be interpreted as HTML code), and invisible characters (like non-breaking spaces). You can also use them in place of other characters that are difficult to type with a standard keyboard.
So, strictly-speaking, an HTML Entity should end with a semi-colon. The problem is, the browser tries to be too helpful. And, in fact, will interpret HTML Entity values even if they aren't valid. For example, if my HTML were to contain the phrase:
How will » evaluate?
» substring - which isn't a valid HTML Entity - won't render as a literal value; instead, the browser will render it as a right-angle quote.
Based on this, my assumption is that the
canonicalize() call in Lucee CFML is taking this loose browser behavior into account when it decodes and normalizes strings. This is causing it to decode parts of a string that weren't encoded to begin with.
To see what I mean, let's take a look at the following ColdFusion code where I am constructing a URL that needs to be run through the
<cfscript> // Setup the key-value pairs for our demo URL. searchParams = [ "action=quantize", "infinity=false", "equivalence=fuzzy", "origin=v7" ]; value = ( "end-point.htm?" & searchParams.toList( "&" ) ); echo( "Lucee " & server.lucee.version ); echo( "<br />" ); echo( "<br />" ); echo( encodeForHtml( canonicalize( value, true, true ) ) ); echo( "<br />" ); echo( "<br />" ); echo( encodeForHtml( value ) ); </cfscript>
ASIDE: Obviously, we wouldn't normally pass an internally-constructed URL to
canonicalize(). Just try to imagine that the URL-in-question is being passed-in from an untrusted source.
There should be nothing in the search-parameters of this URL that cause concern. However, when we run this Lucee CFML code, we get the following output:
As you can see, the
canonicalize() function has "decoded" the following substrings:
Even though these substrings aren't strictly valid HTML Entities, in so much as they don't end with a semi-colon, the
canonicalize() function is treating them as something that might get interpreted by the browser as HTML Entities. This leaves the normalized / sanitized / canonicalized URL as completely invalid.
So, what to do about this?
At this moment, I have no idea. But, I will be consulting with David Epler, our senior application security engineer, to figure out how to deal with case. And, I'll leave updates in the comments below.
It could just be that we're not using the
canonicalize() function as it was intended to be used. To be honest, it's a function that confuses me a bit. I'm not quite sure when and where it's supposed to be applied.
Want to use code from this post? Check out the license.