Hi Ben Nadel, I want to use regex code to extract only domain name for http referrers, can you please give me clue? thanks.
Normally, when we think about the domain name of a URL (which is what the CGI.http_referer value is), we think of the domain name as the part of the URL that comes after the protocol (http://, https://, ftp://, etc.). As such, we could grab the domain name by using a positive look behind on the protocol. However, the regular expression engine that ColdFusion uses (POSIX) does not allow for look behinds. As such, we can either reach down into the Java layer and user the Java Pattern library; or, we can hack the POSIX engine to do what we need.
Since reaching down into the Java layer is probably overkill for such a use case, I will instead explore two different methods to use the POSIX regex engine to get what we need. One method will use the REReplace() function and one method will use the REMatch() function (only available in ColdFusion 8 or later).
Using REReplace() To Get The Domain Name
While it might not seem intuitive to use a replace function to extract part of a string, if we replace the entire string with the substring that we are seeking, what do we end up with? Our target substring. Because the REReplace() function allows use to capture groups in our regular expression and then use them within our "replacement text," we have the ability to replace our original string with just our target string:
<!--- Define a referer. Normally this would come out of the CGI scope, but for now, we are going to simulate it.\ ---> <cfset referer = "http://www.shemuscle.com/category/anonymous/" /> <!--- Because standard ColdFusion regex (POSIX) does not give us nice look behind functionality, we don't have a nice way to gather the match using reMathc(). As such, the easiest way we can get the referring domain is to replace the entire string with the captured domain. Notice that in the following reReplace() statement, we are matching the entire string; but, we are only capturing the domain in group one (\1). Then, we replace the entire string (referer value) with the contents of the first group (domain). This leaves us with just the domain name of the referer. ---> <cfset refererDomain = reReplace( referer, "^\w+://([^\/:]+)[\w\W]*$", "\1", "one" ) /> <!--- Output the referer. ---> Referer Domain: #refererDomain#
Notice that in the regular expression pattern, we are matching the entire URL, but we are only capturing the domain value. In doing so, it allows us to reference the captured domain in our replacement text. And, as you can see, we are replacing the entire URL with the value of the captured group - our domain. And, when we run this code, we get the following output:
Referer Domain: www.shemuscle.com
Using REMatch() To Get The Domain Name
If you are using ColdFusion 8, you can use the REMatch() function to gather all matches in a given string. We can use this function to match parts of the target URL and then pluck the domain name out of the returned matches. Because regular expressions are evaluated from left to right in a greedy fashion, we can have our regular expression pattern match parts of the domain moving from left to right; first, we'll match the protocol, then the domain name, then the rest of the string:
<!--- Define a referer. Normally this would come out of the CGI scope, but for now, we are going to simulate it.\ ---> <cfset referer = "http://www.shemuscle.com/category/anonymous/" /> <!--- Extract the various parts of the URL. ---> <cfset urlParts = reMatch( "^\w+://|[^\/:]+|[\w\W]*$", referer ) /> <!--- Output the parts we captured: ---> <cfloop index="urlPart" array="#urlParts#"> Part: #urlPart#<br /> </cfloop>
Notice in this code that our regular expression matches the three crucial parts of the domain. And, when we run this code, we get the following output:
In this case, the domain name is the second item matched and can be extracted from the matches using urlParts[ 2 ].
Using java.net.URL To Get The Domain Name
As a final method, let's quickly explore the Java URL object. In the previous examples, we had to do all of the heavy lifting ourselves in terms of figuring out how to parse the URL using regular expressions. Well, if we use the Java class, java.net.URL, we can offload that heavy lifting. If we create an instance of the Java URL class and initialize it with our target URL, it will parse the URL internally and give use access to the URL components:
<!--- Define a referer. Normally this would come out of the CGI scope, but for now, we are going to simulate it.\ ---> <cfset referer = "http://www.shemuscle.com/category/anonymous/" /> <!--- Create a Java URL object based on our referer URL. ---> <cfset javaUrl = createObject( "java", "java.net.URL" ).init( javaCast( "string", referer ) ) /> <!--- The Java url has parsed the url for us and we can now extract the components from our Java url instance. ---> Referer Domain: #javaUrl.getHost()#
Notice that all we have to do is create the URL instance and pass in our referer URL. The Java URL takes care of the rest. Then, all we have to do is ask it of for the domain name (host) of the given URL. And, when we run the above code, we get the following output:
Referer Domain: www.shemuscle.com
Works like a charm and we didn't have to get our hands dirty with any regular expressions.
Regular expression are a great tool in the programming toolbox; and, they are amazing for string parsing. But sometimes, we can offload the processing of strings to existing pieces of functionality like the Java URL class, and get what we need without any of the complexity associated with regular expressions. I hope this helps!
Want to use code from this post? Check out the license.
Some nice methods there Ben; I certainly didn't know about the JAVA method. For a similar thing, I've gone down the following route:
<cfset referer = "http://www.shemuscle.com/category/anonymous/" />
This is obviously assuming there is a "http://" at the front of the referer or URL but a simple check can be put in place to detect that and amend the output accordingly. I always tend to shy away from reg ex's due to never quite "getting" them.
Tom, nice one! List usage is definitely an easy and straightforward way to go! I suppose we could also use ListToArray() and access it that way as well!
I think this UDF does this, and more...
For the record, I'm a fan of java.net.URL too!
It's good to see Ben always giving multiple solution :) I didn't know about the JAVA method as well.
I generally just do: ListGetAt(referer,2,'/')
Of-course a check is required to make sure list has at-least 2 item.
That looks like an intense UDF. That Dan Switzer is a really brilliant programmer.
Thank my man. Yeah, I totally forgot about using lists :)
Speaking of regex, links and all that is there any way possible to get the value of a href?
I know how to get the actual anchor text but not the href value.
Are you talking about getting it out of Anchor tags in a chunk of content?
<a href="#I_WANT_THIS#">Link Text</a>
I'm pretty sure I could do some sort of array manipulation or something to get it.
Where are you getting these anchor tags? I'm just trying to get a sense of what your use-case fully is.
The anchor tags are pulled from everywhere. There is no one set place that I pull them from.
We can use wikipedia as an example as this is what I'm currently working on.
The easiest thing would probably be to extract all of the anchor tags, then from each of those, extract the HREF value. I think trying to go directly to the HREF value might be overly complicated.
Jody: Does it need to be done on the server side? It would be really easy with jQuery on the client side.
Actually it does need to be done server side. I wish I could do it client side that would save a lot of resources but at any rate I have figured it out. I just converted the whole page into arrays that are delimted by this
I then delimted that array by a " so I can call on each URL pretty easily. If someone wants the code you can just email me
It's really simple when you actually think about it.
But thanks for helping me out I really appreciate it.
When you do that, you just have to be careful if the page has any instances of "href=" that are not part of actual HREF tags. For example, pages that have sample code on it would have href tags that are not true HREF tags. But, that said, sounds like it's working for you, so I'm not gonna rock the boat.
Correct you are Ben.
I did a few extra things to correct this issue.
@Ben I am not sure that ListToArray() will help here.
For anyone trying to programmatically do stuff with the anchor on the response page besides the default behavior of the browser scrolling to the named anchor, the browser doesn't send it to the server so you won't get it in the cgi scope.
If your response page has jQuery, the anchor is "myAnchor" and the id of the div you want to highlight is "foo"...
This post just saved me a few minutes years after the fact. Thanks!