Canonicalizing A URL By Its Individual Components In Lucee CFML 5.3.6.61

By Ben Nadel on May 22, 2020

CAUTION: I am not a security expert. I do work with security experts; and I try my best to implement their vision and understanding of the world and malicious actors; but, I am just a flawed human. As such, please see this post as an exploration and not necessarily as any kind of best practice.

The other day, I wrote about how the canonicalize() function decodes strings that look like HTML entities in Lucee 5. This feature of the function will inadvertently corrupt a URL that is being sanitized. There are other reasons that calling canonicalize() on an entire URL is a bad idea (such as the fact that URLs can contain non-malicious encoded values). As such, I wanted to see if I could canonicalize a URL by breaking it up into its various URI components; calling canonicalize() on the individual parts; and then join those components back into a single, sanitized URL in Lucee CFML 5.3.6.61.

To be clear, canonicalization of a URL is a trade-off. When you canonicalize a URL in an effort to protect the user, you are going to break some things. For example, attacks are often perpetrated using double-encoding. However, it's perfectly reasonable for a non-malicious URL to contain double-encoded values, specially in search-string. For example, if you Google for:

What is %2f in a URL?

The %2f is going to be encoded into the Google search URL as %252f, since the % has to be encoded as %25 in order to not confuse the server. This leads to a URL that contains a double-encoded value. As such, if you attempt to canonicalize said search URL, the double-encoded value may be stripped out (depending on your settings), which will corrupt the URL.

Honestly, I am not even sure if stripping double-encoding from the search string is necessary. Double-encoding may be more relevant to the domain and path components of a URL? Again, I'm not a security expert.

That said, I've sat down and tried to create a ColdFusion component that will take a URL, break it up into its various components, and then run canonicalize() and encodeForUrl() on them individually. The hope is to sanitize the URL; but, not in so aggressive a manner that common elements get corrupted.

Here's what I came up with - this ColdFusion component, UrlSanitizer.cfc, only exposes a single method: sanitizeUrl().

component
	output = false
	hint = "I provide methods for sanitizing / canonicalizing URLs."
	{

	/**
	* I sanitize the given URL by breaking it apart, running canonicalize() on the
	* individual URI components, and then joining the components back together.
	* 
	* @urlInput I am the input being sanitized / canonicalized.
	*/
	public string function sanitizeUrl( required string urlInput ) {

		// Attempting to call canonicalize() on the entire URL would lead to corrupted
		// URLs. As such, we need to split the URLs up into smaller components and then
		// canonicalize the individual parts.
		var parts = splitUrl( urlInput );
		var sanitizedResource = canonicalizeResource( parts.resource );
		var sanitizedSearch = canonicalizeSearch( parts.search );

		return( joinUrl( sanitizedResource, sanitizedSearch ) );

	}

	// ---
	// PRIVATE METHODS.
	// ---

	/**
	* I canonicalize the given resource string, parsing it, canonicalizing the individual
	* path segments, and then collapsing it back down to a string.
	* 
	* @resourceInput I am the resource input being canonicalized.
	*/
	private string function canonicalizeResource( required string resourceInput ) {

		// First, let's normalize some slashes to make the parsing and pattern-matching
		// easier to perform.
		var normalizedInput = resourceInput
			.trim()
			// Make sure all variations on slashes are forward-slashes.
			.replace( "\", "/", "all" )
			// Reduce any strings of slashes down to a max of 2 (we need to limit it to
			// two at this point in order to allow for the protocol - we will further
			// reduce the slashes in a later step).
			.reReplace( "/{3,}", "//", "all" )
		;

		// Now that our slashes are normalized, let's split the resource into its parts.
		var parts = splitResource( normalizedInput );
		// The domain can be canonicalized on its own.
		var sanitizedDomain = canonicalizeUriComponent( parts.domain );
		// BUT, the PATH of the resource has be further broken down into its components
		// so that we can canonicalize them individually.
		var sanitizedPath = parts.path
			// Reduce any strings of slashes down to 1 - only the protocol is allowed to
			// have two slashes, and that's been broken-out into its own part above.
			.reReplace( "//+", "/", "all" )
			// Split the path up into separators and segments.
			.reMatch( "/|[^/]+" )
			.map(
				( segment ) => {

					if ( segment == "/" ) {

						return( segment );

					}

					return( encodeUriComponent( canonicalizeUriComponent( segment ) ) );

				}
			)
			.toList( "" )
			// At this point, the canonicalization process may have stripped-out
			// malicious encodings that have generated a path that contains a string of
			// slashes. Let's collapse any string-of-slashes down to a single slash.
			.reReplace( "//+", "/", "all" )
		;

		return( sanitizedDomain & sanitizedPath );

	}


	/**
	* I canonicalize the given search string, parsing it, canonicalizing the individual
	* search components, and then collapsing it back down to a string.
	* 
	* @searchInput I am the search input being canonicalized.
	*/
	private string function canonicalizeSearch( required string searchInput ) {

		var sanitizedParams = parseSearch( searchInput )
			.map(
				( searchParam ) => {

					return(
						encodeUriComponent( canonicalizeUriComponent( searchParam.key ) ) &
						"=" &
						encodeUriComponent( canonicalizeUriComponent( searchParam.value ) )
					);

				}
			)
		;

		return( sanitizedParams.toList( "&" ) );

	}


	/**
	* I run the core canonicalize() function using safety constraints (and security
	* settings) which guarantee that a string will always be returned. This is intended
	* to perform the BASIC canonicalization, regardless of which part of the URI it is
	* being run against.
	* 
	* @value I am the value being canonicalized.
	*/
	private string function canonicalizeUriComponent( required string value ) {

		// CAUTION: Spaces seem to cause a lot of problems in URLs when you decode and
		// and then re-encode them. The canonicalize() function allows the "+" to come
		// through as-is. But, subsequent calls to encodeForUrl() then seem to cause
		// issues, especially if the given value has an encoded-plus in it as well. As
		// such, let's explicitly normalize the "+" to indicate a space and to
		// differentiate it from any encoded-plus characters.
		var canonicalValue = value.replace( "+", " ", "all" );

		try {

			// 1st TRUE: Checking for multiple / double encoding (throws error).
			// 2nd TRUE: Checking for mixed encoding (throws error).
			// --
			// NOTE: Calling canonicalize() on an empty-string returns NULL in earlier
			// versions of Lucee. As such, using Elvis-operator with fall-back string.
			canonicalValue = ( canonicalize( canonicalValue, true, true ) ?: "" );

		} catch ( any error ) {

			canonicalValue = "";

		}

		canonicalValue = canonicalValue
			// Strip out control characters.
			.reReplace( "[[:cntrl:]]", "", "all" )

			// Strip out high-ASCII values (above \x00-\x7F). While a URL is allowed to
			// have high-ASCII values, they are are typically included as a means to
			// trick the user into a feeling of false-safety with visually-similar
			// characters. So, TO BE CLEAR, this step WILL BREAK SOME URLs - your mileage
			// may vary.
			.reReplace( "[^[:ascii:]]", "", "all" )
		;

		return( canonicalValue );

	}


	/**
	* I encode the given value for use in a URL.
	* 
	* @value I am the value being encoded.
	*/
	private string function encodeUriComponent( required string value ) {

		// It's possible that encodeForUrl() is overly-aggressive in how it escapes parts
		// of the URL. According to the RFC 3986 spec for "Uniform Resource Identifiers"
		// (URIs), the following values are "unreserved" characters; as such, we're going
		// to put them back into the URL.
		// --
		// SPEC: http://tools.ietf.org/html/rfc3986#section-2.3
		var encodedValue = encodeForUrl( value )
			.replace( "%7E", "~", "all" )
			// This one isn't necessary; but, I think it makes for a more attractive URL.
			.replace( "%20", "+", "all" )
		;

		return( encodedValue );

	}


	/**
	* I join the given resource and search strings together to form a URL. If the
	* resource is empty, the search will be discarded (since a URL without a resource
	* wouldn't make any sense).
	* 
	* @resourceInput I am the resource portion of the desired URL.
	* @searchInput I am the search portion of the desired URL.
	*/
	private string function joinUrl(
		required string resourceInput,
		required string searchInput
		) {

		if ( ! resourceInput.len() || ! searchInput.len() ) {

			return( resourceInput );

		}

		return( resourceInput & "?" & searchInput );

	}


	/**
	* I parse the given search string into an array of key-value pairs. If a search
	* parameter doesn't have an assignment, its value will be returned as the empty-
	* string.
	* 
	* @searchInput I am the search input being parsed.
	*/
	private array function parseSearch( required string searchInput ) {

		var searchParams = [];

		// First, let's replace-out any escaped ampersands in order to make the parsing
		// of the the search-string easier. We will inject these back into the result
		// when aggregating the individual components.
		var escapedAmp = "&amp;";
		var placeholderAmp = "____________AMP____________";
		// After this, every remaining "&" should be a legitimate search delimiter.
		var escapedInput = searchInput.replaceNoCase( escapedAmp, placeholderAmp, "all" );

		for ( var match in escapedInput.listToArray( "&" ) ) {

			if ( match.find( "=" ) ) {

				var key = match.listFirst( "=" );
				var value = match.listRest( "=" );

			} else {

				var key = match;
				var value = "";

			}

			// As we collect the key-value pairs, put the escaped-ampersands back in.
			searchParams.append({
				key: key.replace( placeholderAmp, escapedAmp, "all" ),
				value: value.replace( placeholderAmp, escapedAmp, "all" )
			});

		}

		return( searchParams );

	}


	/**
	* I split the given resource into Domain and Path parts. If there is no domain in
	* the resource, the domain is returned as an empty string.
	* 
	* @resourceInput I am the resource input being split.
	*/
	private struct function splitResource( required string resourceInput ) {

		// Try to match a prefix that looks like a protocol (with or without a scheme)
		// followed by a domain.
		// --
		// Example: https://www.bennadel.com
		// Example: //www.bennadel.com
		var domainMatches = resourceInput.reMatchNoCase( "^([a-z]+:)?//[^/?]*" );

		if ( domainMatches.len() ) {

			var domain = domainMatches[ 1 ];
			var path = resourceInput.mid( ( domain.len() + 1 ), resourceInput.len() );

			return({
				domain: domain,
				path: path
			});

		} else {

			return({
				domain: "",
				path: resourceInput
			});

		}

	}


	/**
	* I split the given URL into a "Resource" and a "Search" component. If there is no
	* search delimiter (?) in the given URL, the search will be returned as an empty-
	* string.
	* 
	* @urlInput I am the URL input being split.
	*/
	private struct function splitUrl( required string urlInput ) {

		if ( urlInput.find( "?" ) ) {

			return({
				resource: urlInput.listFirst( "?" ),
				search: urlInput.listRest( "?" )
			});

		} else {

			return({
				resource: urlInput,
				search: ""
			});

		}

	}

}

I've tried to leave a lot of comments in the code (typical Ben-style). But, basically, this code is breaking the URL into three parts:

Domain
Path
Search

In then iterates over the components within each one of those parts in order to canonicalize and then encode them separately.

One thing I'm not sure about is whether or not the search string ever acts as an attack vector? Where as something like double-encoding will almost certainly never make sense in the domain and path portions of a URL, I can easily create a scenario (see the %2f discussion above) in which the search string contains a double-encoded value. As such, I'm wondering if there's ever a need to call canonicalize() on the search string? Perhaps splitting the URL on the ? and simply canonicalizing the resource would be sufficient.

The same discussion could be made for stripping-out high-ASCII values. High-ASCII values aren't in-and-of-themselves malicious. However, in a URL, they become an attack vector because you can lull a user into a false sense of security by creating a domain-name that looks like an official domain name, but is actually using foreign characters.

For example, imagine registering a domain-name in which you use the Greek letter omikron (decimal value 959) instead of the English letter o:

var malicious = ( "g" + String.fromCharCode( 959 ).repeat( 2 ) + "gle" );
console.log( malicious ); // "google"
console.log( malicious == "google" ); // false

By leaving the high-ASCII values in the domain (or not converting them to Punycode), a user may click on something that looks like a link to a trusted resource (ie, Google), but is actually a link to a malicious site. Which begs the question: is something like this only meaningful in the domain name portion of the URL?

I'm sorry that I only have questions, not answers to this stuff. This is why we have a dedicated team of application security experts whose passion it is to understand these topics in much better depth than I do.

Anyway, this was a fun code-kata on parsing URLs in ColdFusion if nothing else. I am sure there are Java libraries that I could have used to do this for me; so, if you have one to suggest, please let me know!

Want to use code from this post? Check out the license.

Short link: https://bennadel.com/go/3832

Reader Comments

Brad Wood May 26, 2020 at 1:07 PM

45 Comments

Hi Ben, custom parsing of a URL can be handled much simpler with the java.net.URL class in the JDK. Here's a post I did on it years ago:

http://wwvv.codersrevolution.com/blog/Fun-with-javanetURL

Ben Nadel May 26, 2020 at 6:26 PM

15,674 Comments

@Brad,

Yeah, in retrospect, it probably would been easier to use the java.net.URL class to get me started. With that, I could have at least had it get the protocol, path, and search string into addressable methods. I'd still have to further parse the search string into key-value pairs; but, it would definitely have reduced the overall work.

Oh my chickens, this post is old!

Hit me up on Twitter if you want to discuss it further.