Ask Ben: Getting The Domain Name From The Referer URL

Posted August 26, 2009 at 9:24 AM

Tags: ColdFusion, Ask Ben

Hi Ben Nadel, I want to use regex code to extract only domain name for http referrers, can you please give me clue? thanks.

Normally, when we think about the domain name of a URL (which is what the CGI.http_referer value is), we think of the domain name as the part of the URL that comes after the protocol (http://, https://, ftp://, etc.). As such, we could grab the domain name by using a positive look behind on the protocol. However, the regular expression engine that ColdFusion uses (POSIX) does not allow for look behinds. As such, we can either reach down into the Java layer and user the Java Pattern library; or, we can hack the POSIX engine to do what we need.

Since reaching down into the Java layer is probably overkill for such a use case, I will instead explore two different methods to use the POSIX regex engine to get what we need. One method will use the REReplace() function and one method will use the REMatch() function (only available in ColdFusion 8 or later).

Using REReplace() To Get The Domain Name

While it might not seem intuitive to use a replace function to extract part of a string, if we replace the entire string with the substring that we are seeking, what do we end up with? Our target substring. Because the REReplace() function allows use to capture groups in our regular expression and then use them within our "replacement text," we have the ability to replace our original string with just our target string:

 Launch code in new window » Download code as text file »

  • <!---
  • Define a referer. Normally this would come out of the CGI
  • scope, but for now, we are going to simulate it.\
  • --->
  • <cfset referer = "http://www.shemuscle.com/category/anonymous/" />
  •  
  • <!---
  • Because standard ColdFusion regex (POSIX) does not give us
  • nice look behind functionality, we don't have a nice way to
  • gather the match using reMathc(). As such, the easiest way we
  • can get the referring domain is to replace the entire string
  • with the captured domain.
  •  
  • Notice that in the following reReplace() statement, we are
  • matching the entire string; but, we are only capturing the
  • domain in group one (\1). Then, we replace the entire string
  • (referer value) with the contents of the first group (domain).
  • This leaves us with just the domain name of the referer.
  • --->
  • <cfset refererDomain = reReplace(
  • referer,
  • "^\w+://([^\/:]+)[\w\W]*$",
  • "\1",
  • "one"
  • ) />
  •  
  • <!--- Output the referer. --->
  • Referer Domain: #refererDomain#

Notice that in the regular expression pattern, we are matching the entire URL, but we are only capturing the domain value. In doing so, it allows us to reference the captured domain in our replacement text. And, as you can see, we are replacing the entire URL with the value of the captured group - our domain. And, when we run this code, we get the following output:

Referer Domain: www.shemuscle.com

Using REMatch() To Get The Domain Name

If you are using ColdFusion 8, you can use the REMatch() function to gather all matches in a given string. We can use this function to match parts of the target URL and then pluck the domain name out of the returned matches. Because regular expressions are evaluated from left to right in a greedy fashion, we can have our regular expression pattern match parts of the domain moving from left to right; first, we'll match the protocol, then the domain name, then the rest of the string:

 Launch code in new window » Download code as text file »

  • <!---
  • Define a referer. Normally this would come out of the CGI
  • scope, but for now, we are going to simulate it.\
  • --->
  • <cfset referer = "http://www.shemuscle.com/category/anonymous/" />
  •  
  • <!--- Extract the various parts of the URL. --->
  • <cfset urlParts = reMatch(
  • "^\w+://|[^\/:]+|[\w\W]*$",
  • referer
  • ) />
  •  
  • <!--- Output the parts we captured: --->
  • <cfloop
  • index="urlPart"
  • array="#urlParts#">
  •  
  • Part: #urlPart#<br />
  •  
  • </cfloop>

Notice in this code that our regular expression matches the three crucial parts of the domain. And, when we run this code, we get the following output:

Part: http://
Part: www.shemuscle.com
Part: /category/anonymous/

In this case, the domain name is the second item matched and can be extracted from the matches using urlParts[ 2 ].

Using java.net.URL To Get The Domain Name

As a final method, let's quickly explore the Java URL object. In the previous examples, we had to do all of the heavy lifting ourselves in terms of figuring out how to parse the URL using regular expressions. Well, if we use the Java class, java.net.URL, we can offload that heavy lifting. If we create an instance of the Java URL class and initialize it with our target URL, it will parse the URL internally and give use access to the URL components:

 Launch code in new window » Download code as text file »

  • <!---
  • Define a referer. Normally this would come out of the CGI
  • scope, but for now, we are going to simulate it.\
  • --->
  • <cfset referer = "http://www.shemuscle.com/category/anonymous/" />
  •  
  • <!--- Create a Java URL object based on our referer URL. --->
  • <cfset javaUrl = createObject( "java", "java.net.URL" ).init(
  • javaCast( "string", referer )
  • ) />
  •  
  • <!---
  • The Java url has parsed the url for us and we can now extract
  • the components from our Java url instance.
  • --->
  • Referer Domain: #javaUrl.getHost()#

Notice that all we have to do is create the URL instance and pass in our referer URL. The Java URL takes care of the rest. Then, all we have to do is ask it of for the domain name (host) of the given URL. And, when we run the above code, we get the following output:

Referer Domain: www.shemuscle.com

Works like a charm and we didn't have to get our hands dirty with any regular expressions.

Regular expression are a great tool in the programming toolbox; and, they are amazing for string parsing. But sometimes, we can offload the processing of strings to existing pieces of functionality like the Java URL class, and get what we need without any of the complexity associated with regular expressions. I hope this helps!

Download Code Snippet ZIP File

Post Comment  |  Ask Ben  |  Permalink  |  Other Searches  |  Print Page



Learning ColdFusion 9 - ColdFusion 9 tutorials, samples, examples, demos

Reader Comments

Aug 26, 2009 at 9:53 AM // reply »
6 Comments

Some nice methods there Ben; I certainly didn't know about the JAVA method. For a similar thing, I've gone down the following route:

<cfset referer = "http://www.shemuscle.com/category/anonymous/" />
<Cfset a=ListToArray(referer,"/")>
<cfdump var="#a[2]#">

This is obviously assuming there is a "http://" at the front of the referer or URL but a simple check can be put in place to detect that and amend the output accordingly. I always tend to shy away from reg ex's due to never quite "getting" them.

Tom


Aug 26, 2009 at 9:56 AM // reply »
6,516 Comments

@Tom,

Tom, nice one! List usage is definitely an easy and straightforward way to go! I suppose we could also use ListToArray() and access it that way as well!

Awesome tip!


Aug 26, 2009 at 11:30 AM // reply »
45 Comments

I think this UDF does this, and more...

http://cflib.org/udf/parseUrl

For the record, I'm a fan of java.net.URL too!


Aug 28, 2009 at 12:03 PM // reply »
3 Comments

It's good to see Ben always giving multiple solution :) I didn't know about the JAVA method as well.

I generally just do: ListGetAt(referer,2,'/')

Of-course a check is required to make sure list has at-least 2 item.


Aug 28, 2009 at 12:10 PM // reply »
6,516 Comments

@Todd,

That looks like an intense UDF. That Dan Switzer is a really brilliant programmer.

@Sumit,

Thank my man. Yeah, I totally forgot about using lists :)


Aug 31, 2009 at 3:09 PM // reply »
14 Comments

Speaking of regex, links and all that is there any way possible to get the value of a href?

I know how to get the actual anchor text but not the href value.


Sep 2, 2009 at 8:49 AM // reply »
6,516 Comments

@Jody,

Are you talking about getting it out of Anchor tags in a chunk of content?


Sep 2, 2009 at 10:26 PM // reply »
14 Comments

Ex.)

<a href="#I_WANT_THIS#">Link Text</a>

I'm pretty sure I could do some sort of array manipulation or something to get it.


Sep 6, 2009 at 11:46 AM // reply »
6,516 Comments

@Jody,

Where are you getting these anchor tags? I'm just trying to get a sense of what your use-case fully is.


Sep 7, 2009 at 11:55 PM // reply »
14 Comments

@Ben

The anchor tags are pulled from everywhere. There is no one set place that I pull them from.

We can use wikipedia as an example as this is what I'm currently working on.


Sep 8, 2009 at 7:49 AM // reply »
6,516 Comments

@Jody,

The easiest thing would probably be to extract all of the anchor tags, then from each of those, extract the HREF value. I think trying to go directly to the HREF value might be overly complicated.


Sep 8, 2009 at 8:55 AM // reply »
45 Comments

Jody: Does it need to be done on the server side? It would be really easy with jQuery on the client side.

$('a').each(function(){alert($(this).attr('href'))});


Sep 8, 2009 at 11:19 PM // reply »
14 Comments

Actually it does need to be done server side. I wish I could do it client side that would save a lot of resources but at any rate I have figured it out. I just converted the whole page into arrays that are delimted by this

href=

I then delimted that array by a " so I can call on each URL pretty easily. If someone wants the code you can just email me

creditprovided[at-sym]ymail.com

It's really simple when you actually think about it.

But thanks for helping me out I really appreciate it.


Sep 12, 2009 at 10:59 PM // reply »
6,516 Comments

@Jody,

When you do that, you just have to be careful if the page has any instances of "href=" that are not part of actual HREF tags. For example, pages that have sample code on it would have href tags that are not true HREF tags. But, that said, sounds like it's working for you, so I'm not gonna rock the boat.


Sep 22, 2009 at 12:42 AM // reply »
14 Comments

Correct you are Ben.

I did a few extra things to correct this issue.


Post Comment  |  Ask Ben

Recent Blog Comments
Nov 20, 2009 at 11:32 PM
Five Months Without Hungarian Notation And I'm Loving It
I've used headless camel case for years for not only ColdFusion variables, but also SQL tables and fields... pretty much everything involving code. I also subscribe to the "don't abbreviate and clea ... read »
Nov 20, 2009 at 11:00 PM
Five Months Without Hungarian Notation And I'm Loving It
@Marcel, Yeah, I always err on the side of longer but more readable variable names. As for the camel casing of CF methods and the headless camel casing of custom items, I get around this by always ... read »
Nov 20, 2009 at 10:56 PM
Five Months Without Hungarian Notation And I'm Loving It
I use the following and love it: my.namespace.MyComponents.functionMethodsOrUDF() CONSTANT_VALUES_OR_PROPERTIES One thing I always try is to CamelCaseBuiltInColdFusionFunctions() so others can tell ... read »
Nov 20, 2009 at 5:38 PM
Learning ColdFusion 8: CFImage Part I - Reading And Writing Images
Hi Ben, Great article. I've been looking around to see if ColdFusion image engine can programatically create the following "wrap around" effect: http://www.creativepro.com/article/photoshop-s-she ... read »
Nov 20, 2009 at 5:35 PM
Maintaining ColdFusion Sessions Across SMS Text Message Requests Without Cookies
@Dave: I talked to Gert he suggested: <cfhttp method="get" url="http://{some cf website}" result="stuff" addtoken="yes" /> Note the addition of cfhttp attribute addtoken. That should persist y ... read »
Nov 20, 2009 at 5:23 PM
Maintaining ColdFusion Sessions Across SMS Text Message Requests Without Cookies
@Todd, Ahh, gotcha, yeah that makes sense. ... read »
Nov 20, 2009 at 5:17 PM
Maintaining ColdFusion Sessions Across SMS Text Message Requests Without Cookies
Ben, sorry if I didn't make this clear. You can make it work like that if you want, just put <cfset session.foo = 1> (and <cfset application.foo = 1>) in your OnRequestStart() and it reve ... read »