Ask Ben: Getting The Domain Name From The Referer URL

Posted August 26, 2009 at 9:24 AM by Ben Nadel

Tags: ColdFusion, Ask Ben

Hi Ben Nadel, I want to use regex code to extract only domain name for http referrers, can you please give me clue? thanks.

Normally, when we think about the domain name of a URL (which is what the CGI.http_referer value is), we think of the domain name as the part of the URL that comes after the protocol (http://, https://, ftp://, etc.). As such, we could grab the domain name by using a positive look behind on the protocol. However, the regular expression engine that ColdFusion uses (POSIX) does not allow for look behinds. As such, we can either reach down into the Java layer and user the Java Pattern library; or, we can hack the POSIX engine to do what we need.

Since reaching down into the Java layer is probably overkill for such a use case, I will instead explore two different methods to use the POSIX regex engine to get what we need. One method will use the REReplace() function and one method will use the REMatch() function (only available in ColdFusion 8 or later).

Using REReplace() To Get The Domain Name

While it might not seem intuitive to use a replace function to extract part of a string, if we replace the entire string with the substring that we are seeking, what do we end up with? Our target substring. Because the REReplace() function allows use to capture groups in our regular expression and then use them within our "replacement text," we have the ability to replace our original string with just our target string:

  • <!---
  • Define a referer. Normally this would come out of the CGI
  • scope, but for now, we are going to simulate it.\
  • --->
  • <cfset referer = "http://www.shemuscle.com/category/anonymous/" />
  •  
  • <!---
  • Because standard ColdFusion regex (POSIX) does not give us
  • nice look behind functionality, we don't have a nice way to
  • gather the match using reMathc(). As such, the easiest way we
  • can get the referring domain is to replace the entire string
  • with the captured domain.
  •  
  • Notice that in the following reReplace() statement, we are
  • matching the entire string; but, we are only capturing the
  • domain in group one (\1). Then, we replace the entire string
  • (referer value) with the contents of the first group (domain).
  • This leaves us with just the domain name of the referer.
  • --->
  • <cfset refererDomain = reReplace(
  • referer,
  • "^\w+://([^\/:]+)[\w\W]*$",
  • "\1",
  • "one"
  • ) />
  •  
  • <!--- Output the referer. --->
  • Referer Domain: #refererDomain#

Notice that in the regular expression pattern, we are matching the entire URL, but we are only capturing the domain value. In doing so, it allows us to reference the captured domain in our replacement text. And, as you can see, we are replacing the entire URL with the value of the captured group - our domain. And, when we run this code, we get the following output:

Referer Domain: www.shemuscle.com

Using REMatch() To Get The Domain Name

If you are using ColdFusion 8, you can use the REMatch() function to gather all matches in a given string. We can use this function to match parts of the target URL and then pluck the domain name out of the returned matches. Because regular expressions are evaluated from left to right in a greedy fashion, we can have our regular expression pattern match parts of the domain moving from left to right; first, we'll match the protocol, then the domain name, then the rest of the string:

  • <!---
  • Define a referer. Normally this would come out of the CGI
  • scope, but for now, we are going to simulate it.\
  • --->
  • <cfset referer = "http://www.shemuscle.com/category/anonymous/" />
  •  
  • <!--- Extract the various parts of the URL. --->
  • <cfset urlParts = reMatch(
  • "^\w+://|[^\/:]+|[\w\W]*$",
  • referer
  • ) />
  •  
  • <!--- Output the parts we captured: --->
  • <cfloop
  • index="urlPart"
  • array="#urlParts#">
  •  
  • Part: #urlPart#<br />
  •  
  • </cfloop>

Notice in this code that our regular expression matches the three crucial parts of the domain. And, when we run this code, we get the following output:

Part: http://
Part: www.shemuscle.com
Part: /category/anonymous/

In this case, the domain name is the second item matched and can be extracted from the matches using urlParts[ 2 ].

Using java.net.URL To Get The Domain Name

As a final method, let's quickly explore the Java URL object. In the previous examples, we had to do all of the heavy lifting ourselves in terms of figuring out how to parse the URL using regular expressions. Well, if we use the Java class, java.net.URL, we can offload that heavy lifting. If we create an instance of the Java URL class and initialize it with our target URL, it will parse the URL internally and give use access to the URL components:

  • <!---
  • Define a referer. Normally this would come out of the CGI
  • scope, but for now, we are going to simulate it.\
  • --->
  • <cfset referer = "http://www.shemuscle.com/category/anonymous/" />
  •  
  • <!--- Create a Java URL object based on our referer URL. --->
  • <cfset javaUrl = createObject( "java", "java.net.URL" ).init(
  • javaCast( "string", referer )
  • ) />
  •  
  • <!---
  • The Java url has parsed the url for us and we can now extract
  • the components from our Java url instance.
  • --->
  • Referer Domain: #javaUrl.getHost()#

Notice that all we have to do is create the URL instance and pass in our referer URL. The Java URL takes care of the rest. Then, all we have to do is ask it of for the domain name (host) of the given URL. And, when we run the above code, we get the following output:

Referer Domain: www.shemuscle.com

Works like a charm and we didn't have to get our hands dirty with any regular expressions.

Regular expression are a great tool in the programming toolbox; and, they are amazing for string parsing. But sometimes, we can offload the processing of strings to existing pieces of functionality like the Java URL class, and get what we need without any of the complexity associated with regular expressions. I hope this helps!



Reader Comments

Aug 26, 2009 at 9:53 AM // reply »
14 Comments

Some nice methods there Ben; I certainly didn't know about the JAVA method. For a similar thing, I've gone down the following route:

<cfset referer = "http://www.shemuscle.com/category/anonymous/" />
<Cfset a=ListToArray(referer,"/")>
<cfdump var="#a[2]#">

This is obviously assuming there is a "http://" at the front of the referer or URL but a simple check can be put in place to detect that and amend the output accordingly. I always tend to shy away from reg ex's due to never quite "getting" them.

Tom


Aug 26, 2009 at 9:56 AM // reply »
11,314 Comments

@Tom,

Tom, nice one! List usage is definitely an easy and straightforward way to go! I suppose we could also use ListToArray() and access it that way as well!

Awesome tip!


Aug 26, 2009 at 11:30 AM // reply »
48 Comments

I think this UDF does this, and more...

http://cflib.org/udf/parseUrl

For the record, I'm a fan of java.net.URL too!


Aug 28, 2009 at 12:03 PM // reply »
10 Comments

It's good to see Ben always giving multiple solution :) I didn't know about the JAVA method as well.

I generally just do: ListGetAt(referer,2,'/')

Of-course a check is required to make sure list has at-least 2 item.


Aug 28, 2009 at 12:10 PM // reply »
11,314 Comments

@Todd,

That looks like an intense UDF. That Dan Switzer is a really brilliant programmer.

@Sumit,

Thank my man. Yeah, I totally forgot about using lists :)


Aug 31, 2009 at 3:09 PM // reply »
19 Comments

Speaking of regex, links and all that is there any way possible to get the value of a href?

I know how to get the actual anchor text but not the href value.


Sep 2, 2009 at 8:49 AM // reply »
11,314 Comments

@Jody,

Are you talking about getting it out of Anchor tags in a chunk of content?


Sep 2, 2009 at 10:26 PM // reply »
19 Comments

Ex.)

<a href="#I_WANT_THIS#">Link Text</a>

I'm pretty sure I could do some sort of array manipulation or something to get it.


Sep 6, 2009 at 11:46 AM // reply »
11,314 Comments

@Jody,

Where are you getting these anchor tags? I'm just trying to get a sense of what your use-case fully is.


Sep 7, 2009 at 11:55 PM // reply »
19 Comments

@Ben

The anchor tags are pulled from everywhere. There is no one set place that I pull them from.

We can use wikipedia as an example as this is what I'm currently working on.


Sep 8, 2009 at 7:49 AM // reply »
11,314 Comments

@Jody,

The easiest thing would probably be to extract all of the anchor tags, then from each of those, extract the HREF value. I think trying to go directly to the HREF value might be overly complicated.


Sep 8, 2009 at 8:55 AM // reply »
48 Comments

Jody: Does it need to be done on the server side? It would be really easy with jQuery on the client side.

$('a').each(function(){alert($(this).attr('href'))});


Sep 8, 2009 at 11:19 PM // reply »
19 Comments

Actually it does need to be done server side. I wish I could do it client side that would save a lot of resources but at any rate I have figured it out. I just converted the whole page into arrays that are delimted by this

href=

I then delimted that array by a " so I can call on each URL pretty easily. If someone wants the code you can just email me

creditprovided[at-sym]ymail.com

It's really simple when you actually think about it.

But thanks for helping me out I really appreciate it.


Sep 12, 2009 at 10:59 PM // reply »
11,314 Comments

@Jody,

When you do that, you just have to be careful if the page has any instances of "href=" that are not part of actual HREF tags. For example, pages that have sample code on it would have href tags that are not true HREF tags. But, that said, sounds like it's working for you, so I'm not gonna rock the boat.


Sep 22, 2009 at 12:42 AM // reply »
19 Comments

Correct you are Ben.

I did a few extra things to correct this issue.


Nov 11, 2011 at 5:45 PM // reply »
1 Comments

@Ben I am not sure that ListToArray() will help here.


Jul 6, 2012 at 12:17 PM // reply »
2 Comments

For anyone trying to programmatically do stuff with the anchor on the response page besides the default behavior of the browser scrolling to the named anchor, the browser doesn't send it to the server so you won't get it in the cgi scope.

Your only recourse is to process it with javascript on the response page you build. The anchor value is

location.href.split("#")[1]

If your response page has jQuery, the anchor is "myAnchor" and the id of the div you want to highlight is "foo"...

  • <script type="text/javascript">
  • $(document).ready(function(){
  • var anch=location.href.split("#")[1];
  • if(anch=="myAnchor")
  • $("#foo").css("background-color","yellow");
  • });
  • </script>
  • &lt;a name="myAnchor"></a>
  • <div id="foo">hello world</div>


Nov 1, 2012 at 3:50 PM // reply »
155 Comments

This post just saved me a few minutes years after the fact. Thanks!



Post A Comment

Comment Etiquette: Please do not post spam. Please keep the comments on-topic. Please do not post unrelated questions or large chunks of code. And, above all, please be nice to each other - we're trying to have a good conversation here.

Please review the following issues:

Author Name:


Author Email:

Author Website:

Comment:

Supported HTML tags for formatting: <strong>bold</strong>   <em>italic</em>   <code>code</code>







  • Help Wanted - Find Your Next ColdFusion Job
Ben Nadel's Company - Epicenter Consulting Recent Blog Comments
Jun 19, 2013 at 2:01 PM
Experimenting With The Amazon Simple Storage Service (S3) API Using ColdFusion
I have coincidentally been beating my head against the S3 API for the last week or so. One big "gotcha" I had to work around was file names and paths containing spaces. Remember to URL Enco ... read »
Jun 19, 2013 at 1:27 PM
Using Slice(), Substring(), And Substr() In Javascript
very good article. By the way IE supports negative values in substr or slice in verson 10. ... read »
Jun 19, 2013 at 11:33 AM
Filter vs. ngHide With ngRepeat In AngularJS
In your assessment, is it correct to say that given a list of say 500 items its more performant to use the `ngHide` method over the `filter` method? ... read »
Jun 19, 2013 at 10:18 AM
ColdFusion Path Usage And Manipulation Overview
Anyone happen to know if the file created by getTempFile will be automatically removed at any point? Nothing mentioned in the docs, and restarting CF doesn't remove them, so it seems it needs manu ... read »
Jun 19, 2013 at 9:41 AM
Working With Inherited Collections In AngularJS
I actually just ran into this same situation with a demo I was putting together. Your implementation of multi-lvl $scope's > Mine :) ... read »
Jun 19, 2013 at 8:17 AM
My Experience With AngularJS - The Super-heroic JavaScript MVW Framework
@Prateek, to match a word or text you should use .toContain('word') that's a jasmine reference. website is : http://pivotal.github.io/jasmine/ ... read »
Jun 19, 2013 at 8:10 AM
My Experience With AngularJS - The Super-heroic JavaScript MVW Framework
Hi Guys, Actually i am doing e2e test of angular js of my project but i am not getting one thing that is how to press enter key through the test when my form is filled as i am not using a button but ... read »
Jun 18, 2013 at 9:20 PM
Mapping AngularJS Routes Onto URL Parameters And Client-Side Events
I couldn't find examples of passing multiple arguments using the when() routing statement so figured out through trial and error that you can pass multiple arguments using the following format: .whe ... read »
InVision App - Prototyping Made Beautiful With Prototyping Tools