Ask Ben: Getting The Domain Name From The Referer URL

By Ben Nadel

Published 2009-08-26 in Ask Ben, ColdFusion — Comments (18)

Hi Ben Nadel, I want to use regex code to extract only domain name for http referrers, can you please give me clue? thanks.

Normally, when we think about the domain name of a URL (which is what the CGI.http_referer value is), we think of the domain name as the part of the URL that comes after the protocol (http://, https://, ftp://, etc.). As such, we could grab the domain name by using a positive look behind on the protocol. However, the regular expression engine that ColdFusion uses (POSIX) does not allow for look behinds. As such, we can either reach down into the Java layer and user the Java Pattern library; or, we can hack the POSIX engine to do what we need.

Since reaching down into the Java layer is probably overkill for such a use case, I will instead explore two different methods to use the POSIX regex engine to get what we need. One method will use the REReplace() function and one method will use the REMatch() function (only available in ColdFusion 8 or later).

Using REReplace() To Get The Domain Name

While it might not seem intuitive to use a replace function to extract part of a string, if we replace the entire string with the substring that we are seeking, what do we end up with? Our target substring. Because the REReplace() function allows use to capture groups in our regular expression and then use them within our "replacement text," we have the ability to replace our original string with just our target string:

<!---
	Define a referer. Normally this would come out of the CGI
	scope, but for now, we are going to simulate it.\
--->
<cfset referer = "http://www.shemuscle.com/category/anonymous/" />

<!---
	Because standard ColdFusion regex (POSIX) does not give us
	nice look behind functionality, we don't have a nice way to
	gather the match using reMathc(). As such, the easiest way we
	can get the referring domain is to replace the entire string
	with the captured domain.

	Notice that in the following reReplace() statement, we are
	matching the entire string; but, we are only capturing the
	domain in group one (\1). Then, we replace the entire string
	(referer value) with the contents of the first group (domain).
	This leaves us with just the domain name of the referer.
--->
<cfset refererDomain = reReplace(
	referer,
	"^\w+://([^\/:]+)[\w\W]*$",
	"\1",
	"one"
	) />

<!--- Output the referer. --->
Referer Domain: #refererDomain#

Notice that in the regular expression pattern, we are matching the entire URL, but we are only capturing the domain value. In doing so, it allows us to reference the captured domain in our replacement text. And, as you can see, we are replacing the entire URL with the value of the captured group - our domain. And, when we run this code, we get the following output:

Referer Domain: www.shemuscle.com

Using REMatch() To Get The Domain Name

If you are using ColdFusion 8, you can use the REMatch() function to gather all matches in a given string. We can use this function to match parts of the target URL and then pluck the domain name out of the returned matches. Because regular expressions are evaluated from left to right in a greedy fashion, we can have our regular expression pattern match parts of the domain moving from left to right; first, we'll match the protocol, then the domain name, then the rest of the string:

<!---
	Define a referer. Normally this would come out of the CGI
	scope, but for now, we are going to simulate it.\
--->
<cfset referer = "http://www.shemuscle.com/category/anonymous/" />

<!--- Extract the various parts of the URL. --->
<cfset urlParts = reMatch(
	"^\w+://|[^\/:]+|[\w\W]*$",
	referer
	) />

<!--- Output the parts we captured: --->
<cfloop
	index="urlPart"
	array="#urlParts#">

	Part: #urlPart#<br />

</cfloop>

Notice in this code that our regular expression matches the three crucial parts of the domain. And, when we run this code, we get the following output:

Part: http://
Part: www.shemuscle.com
Part: /category/anonymous/

In this case, the domain name is the second item matched and can be extracted from the matches using urlParts[ 2 ].

Using java.net.URL To Get The Domain Name

As a final method, let's quickly explore the Java URL object. In the previous examples, we had to do all of the heavy lifting ourselves in terms of figuring out how to parse the URL using regular expressions. Well, if we use the Java class, java.net.URL, we can offload that heavy lifting. If we create an instance of the Java URL class and initialize it with our target URL, it will parse the URL internally and give use access to the URL components:

<!---
	Define a referer. Normally this would come out of the CGI
	scope, but for now, we are going to simulate it.\
--->
<cfset referer = "http://www.shemuscle.com/category/anonymous/" />

<!--- Create a Java URL object based on our referer URL. --->
<cfset javaUrl = createObject( "java", "java.net.URL" ).init(
	javaCast( "string", referer )
	) />

<!---
	The Java url has parsed the url for us and we can now extract
	the components from our Java url instance.
--->
Referer Domain: #javaUrl.getHost()#

Notice that all we have to do is create the URL instance and pass in our referer URL. The Java URL takes care of the rest. Then, all we have to do is ask it of for the domain name (host) of the given URL. And, when we run the above code, we get the following output:

Referer Domain: www.shemuscle.com

Works like a charm and we didn't have to get our hands dirty with any regular expressions.

Regular expression are a great tool in the programming toolbox; and, they are amazing for string parsing. But sometimes, we can offload the processing of strings to existing pieces of functionality like the Java URL class, and get what we need without any of the complexity associated with regular expressions. I hope this helps!

Want to use code from this post? Check out the license.

Short link: https://bennadel.com/1692

Reader Comments

Tom Jenkins Aug 26, 2009 at 9:53 AM

14 Comments

Some nice methods there Ben; I certainly didn't know about the JAVA method. For a similar thing, I've gone down the following route:

This is obviously assuming there is a "http://" at the front of the referer or URL but a simple check can be put in place to detect that and amend the output accordingly. I always tend to shy away from reg ex's due to never quite "getting" them.

Tom

Ben Nadel Aug 26, 2009 at 9:56 AM

16,125 Comments

@Tom,

Tom, nice one! List usage is definitely an easy and straightforward way to go! I suppose we could also use ListToArray() and access it that way as well!

Awesome tip!

todd sharp Aug 26, 2009 at 11:30 AM

48 Comments

I think this UDF does this, and more...

http://cflib.org/udf/parseUrl

For the record, I'm a fan of java.net.URL too!

Sumit Verma Aug 28, 2009 at 12:03 PM

11 Comments

It's good to see Ben always giving multiple solution :) I didn't know about the JAVA method as well.

I generally just do: ListGetAt(referer,2,'/')

Of-course a check is required to make sure list has at-least 2 item.

Ben Nadel Aug 28, 2009 at 12:10 PM

16,125 Comments

@Todd,

That looks like an intense UDF. That Dan Switzer is a really brilliant programmer.

@Sumit,

Thank my man. Yeah, I totally forgot about using lists :)

Jody Fitzpatrick Aug 31, 2009 at 3:09 PM

19 Comments

Speaking of regex, links and all that is there any way possible to get the value of a href?

I know how to get the actual anchor text but not the href value.

Ben Nadel Sep 2, 2009 at 8:49 AM

16,125 Comments

@Jody,

Are you talking about getting it out of Anchor tags in a chunk of content?

Jody Fitzpatrick Sep 2, 2009 at 10:26 PM

19 Comments

Ex.)

I'm pretty sure I could do some sort of array manipulation or something to get it.

Ben Nadel Sep 6, 2009 at 11:46 AM

16,125 Comments

@Jody,

Where are you getting these anchor tags? I'm just trying to get a sense of what your use-case fully is.

Jody Fitzpatrick Sep 7, 2009 at 11:55 PM

19 Comments

@Ben

The anchor tags are pulled from everywhere. There is no one set place that I pull them from.

We can use wikipedia as an example as this is what I'm currently working on.

Ben Nadel Sep 8, 2009 at 7:49 AM

16,125 Comments

@Jody,

The easiest thing would probably be to extract all of the anchor tags, then from each of those, extract the HREF value. I think trying to go directly to the HREF value might be overly complicated.

todd sharp Sep 8, 2009 at 8:55 AM

48 Comments

Jody: Does it need to be done on the server side? It would be really easy with jQuery on the client side.

$('a').each(function(){alert($(this).attr('href'))});

Jody Fitzpatrick Sep 8, 2009 at 11:19 PM

19 Comments

Actually it does need to be done server side. I wish I could do it client side that would save a lot of resources but at any rate I have figured it out. I just converted the whole page into arrays that are delimted by this

href=

I then delimted that array by a " so I can call on each URL pretty easily. If someone wants the code you can just email me

creditprovided[at-sym]ymail.com

It's really simple when you actually think about it.

But thanks for helping me out I really appreciate it.

Ben Nadel Sep 12, 2009 at 10:59 PM

16,125 Comments

@Jody,

When you do that, you just have to be careful if the page has any instances of "href=" that are not part of actual HREF tags. For example, pages that have sample code on it would have href tags that are not true HREF tags. But, that said, sounds like it's working for you, so I'm not gonna rock the boat.

Jody Fitzpatrick Sep 22, 2009 at 12:42 AM

19 Comments

Correct you are Ben.

I did a few extra things to correct this issue.

Michael Nov 11, 2011 at 5:45 PM

1 Comments

@Ben I am not sure that ListToArray() will help here.

Gordon Jul 6, 2012 at 12:17 PM

2 Comments

For anyone trying to programmatically do stuff with the anchor on the response page besides the default behavior of the browser scrolling to the named anchor, the browser doesn't send it to the server so you won't get it in the cgi scope.

Your only recourse is to process it with javascript on the response page you build. The anchor value is

location.href.split("#")[1]

If your response page has jQuery, the anchor is "myAnchor" and the id of the div you want to highlight is "foo"...

<script type="text/javascript">
$(document).ready(function(){
var anch=location.href.split("#")[1];
if(anch=="myAnchor")
$("#foo").css("background-color","yellow");
});
</script>
&lt;a name="myAnchor"></a>
<div id="foo">hello world</div>

David McGuigan Nov 1, 2012 at 3:50 PM

167 Comments

This post just saved me a few minutes years after the fact. Thanks!

Oh my chickens, this post is old!

Hit me up on LinkedIn if you want to discuss it further.