Ben Nadel
On User Experience (UX) Design, JavaScript, ColdFusion, Node.js, Life, and Love.
I am the chief technical officer at InVision App, Inc - a prototyping and collaboration platform for designers, built by designers. I also rock out in JavaScript and ColdFusion 24x7.
Meanwhile on Twitter
Loading latest tweet...
Ben Nadel at the jQuery Conference 2009 (Cambridge, MA) with:

Using Verbose Regular Expressions To Explain Url Auto-Linking In ColdFusion

By Ben Nadel on
Tags: ColdFusion

Recently on CF-Talk, Dave Phillips was talking about grabbing a list of URLs from a chunk of text. Some people want to grab the URLs - a lot of us would love to auto-link URLs in a chunk of text (such as in a comment posted to your blog). This can be quite difficult due to the complexities and varieties of the URL form. What it comes down to is being able to write a regular expression that properly matches on all the various URLs.

Writing regular expressions is tough. Understanding them is even tougher! What I have done here is used a Verbose regular expression in order to be able to really explain the regular expression construction step by step. A verbose regular expression allows you to use comments. It also ignores any white space that is not explicitly defined via "\n" and "\t" type characters. Below is the regular expression for matching URLs. I am storing it in a CFSaveContent variable buffer for use later in the demo. Please note that all comments begin with # (which must be ## since I am in a CFOutput tag not shown).

  • <!---
  • Set up the regular expression for the URL. We are going
  • with a very verbose RegEx so we can really see what is
  • going on here. Remember, when doing verbose regular
  • expressions, ALL WHITE SPACE must be defined explicitly.
  •  
  • (?ix) = case-insentive and verbose.
  • --->
  • <cfsavecontent variable="strIsUrlRegEx"
  • >(?ix)
  •  
  • ## We are going to wrap the entire match in parenthesis so
  • ## that we can refer to the entire group at match group
  • ## one ($1).
  •  
  • (
  •  
  • ## First, we want to come up with a the web protocol.
  • ## This will probably be HTTP but, let's define some
  • ## others just to make sure we cover the standards.
  • ## Remember, however, that the protocol is not really
  • ## required (or is it????).
  •  
  • (
  • (https?|ftp|gopher)://
  •  
  • ## If we do use a protocol, then we have the option
  • ## to define a username and password for this url.
  • ## Most urls will NOT have this, but sites that use
  • ## Window's authentication (???) use this. Better
  • ## safe than sorry. However, if login credentials
  • ## are used, they must end with "@".
  •  
  • ([^:]+\:[^@]*@)?
  •  
  • )?
  •  
  • ## Now, we have to define the sub-domain. This is generally
  • ## the "www" before the domain, but this might include any
  • ## number of values. This entire value is optional,
  • ## however, if this value is used, it MUST end with a "."
  •  
  • ([\d\w\-]+\.)?
  •  
  • ## Now, let's define the domain. This will be any
  • ## combination of values that has a domain extension
  • ## (ex. .com, .edu) and does not yet include a directory
  • ## structure of any kind. This is NOT an optional value.
  •  
  • [\w\d\-\.]+\.[\w\d]+
  •  
  • ## Once the domain and extension are defined, everything
  • ## else is optional. That means that everything after this
  • ## point MIGHT be there, but is not required. Therefore,
  • ## we have to group the rest of this and make it optional.
  •  
  • (
  •  
  • ## After the domain name and extension comes the
  • ## optional directory structure. This structure is
  • ## optional, but there may also be more than one
  • ## directories nested (hence the use of "*").
  •  
  • (
  • /[\w\d\-@%]+
  • )*
  •  
  •  
  • ## Now that we have defined the directory structure,
  • ## we have to define the optional file name and query
  • ## string. Because the file must be separate from both
  • ## the domain as well as any other directory structure,
  • ## we will require that it begin with a slash.
  •  
  • (
  • /
  •  
  • ## This is the file name which is an optional part
  • ## of the filename / query string combo.
  •  
  • (
  • ## File name
  •  
  • [\w\d\.\_\-@]+
  •  
  • \.
  •  
  • ## File extension.
  •  
  • [\w\d]+
  • )?
  •  
  • ## Now that we have the file name defined, we can
  • ## define the optional query string. The query
  • ## string, if it is used, MUST begin with the
  • ## question mark literal "?".
  •  
  • (
  • \?
  •  
  • ## After the query string delimiter (?), pretty
  • ## much anything is fair game. I don't actually
  • ## know what is ## valid in a URL so I just go
  • ## with this set of characters which is what I
  • ## see a lot.
  •  
  • [\w\d\?%,\.\/\##!@:=\+~_\-&]*
  •  
  • ## Even though the query string can contain
  • ## just about anything, I want to make sure that
  • ## it does NOT end with certain characters. Of
  • ## the characters above, I find that URLs using
  • ## query strings often should not include the
  • ## final "." as that is usually the period for
  • ## the containing sentence. Use a negative
  • ## look-behind to make sure that the previous
  • ## character is not of this set.
  • ## NOTE: This is available in Java version only.
  •  
  • (?<![\.])
  •  
  • )?
  •  
  • ## ^ Ended the optional query string.
  •  
  • )?
  •  
  • ## ^ Ended the optional file name / query string part.
  •  
  • )?
  •  
  • ## ^ We just ended the optional post-domain-extension area
  • ## of the url.
  •  
  • )
  •  
  • ## ^ Just ended the convenience group one ($1) that allows
  • ## us to refer the entire URL at the first group.
  •  
  • </cfsavecontent>

Now, admittedly, I do NOT know what makes a proper URL. I do not know what the set of valid characters is for a domain or a file name or a directory structure or even for a query string. Above is an approximation of the URLs that I have come across. As you find URLs that are NOT matched by the above regular expression, you will need to tweak the regular expression as you find fit.

On top of that, please realize that there is a difference between the URL you find coming from the server and the URL coming from a user's entry. Once a URL goes through the browser and the server, it might have values like spaces automatically replaced with things like "%20" but REALIZE that when a user enters such a URL in a text box, the text blob does NOT have these values automatically escaped. Therefore, your regular expression will probably be tweaked differently depending on the situation in which you are going to use it.

Now that we have our regular expression set up, let's create a buffer of URLs that we can test this against:

  • <!---
  • Save a list of urls that we are going to test against.
  • We are going to assume that a url is one-per-line. Let's
  • put the valid URLs at the top and the invalid URLs
  • at the bottom of the list.
  • --->
  • <cfsavecontent variable="strUrls">
  •  
  • <!--- These are URLs that we know should work. --->
  •  
  • http://xxxx.xxxx.com/xxxxxx/xxxxxxx.cfm?cdcrs=RD4823
  • http://xxxx.xxxx.com/xxxxxx/xxxxxxx.cfm?cdcrs=RD4823&empcode=GUEST43352&empcode=GUEST43352
  • http://xxxx.xxxx.com/on_job/supplychain/index.html
  • http://xxxx.xxxx.com
  • http://xxx.xxxxx.com/xxxxxx
  • https://xxxxx.xxxxxl.com/xxxxx/xxx/
  • http://www.bennadel.com/?cool=true
  • http://ben:cuties@www.hot-and-curvey.com
  • www.bennadel.com
  • bennadel.com
  •  
  • <!---
  • Below are URLs that should fail when matched against
  • our regular expression. We are putting them in to
  • make sure that our regex is not too relaxed.
  • --->
  •  
  • <!--- Not a url. --->
  • ben
  •  
  • <!--- Does not have HTTP. --->
  • htp://www.bennadel.com
  •  
  • <!--- Has space in domain. --->
  • http://www.ben nadel.com
  •  
  • <!--- Ends with period without query string. --->
  • http://xxxx.xxxx.com.
  •  
  • <!--- Does not have a domain extension. --->
  • http://hotchicks/sarah.htm
  •  
  • <!--- Does not have a domain name / extension. --->
  • http:///test.cfm
  •  
  • <!--- No directory name (used with "/"). --->
  • http://www.chick-o-matic.com//
  •  
  • <!--- No file name. --->
  • http://www.coldfusion-rocks.com/.cfm
  •  
  • <!--- Ends with period (and no query string) --->
  • http://www.i-heart-coldfusion.com/it-is-sexy.cfm.
  •  
  • <!--- Space in file name. --->
  • http://www.coldfusion-queries.com/how to.cfm
  •  
  • http://joesmith@coldfusion-is-sweet.com
  •  
  • </cfsavecontent>

Now that we have the urls list, let's convert it to an array, loop over it, and run it against our regular expression for URL patterns:

  • <!---
  • Convert the list of urls to an array so that
  • we may easily loop over it.
  • --->
  • <cfset arrUrls = strUrls.Trim().Split( "\r\n" ) />
  •  
  •  
  • <!---
  • Loop over the urls and test against our verbose
  • regular expression.
  • --->
  • <cfloop
  • index="intUrl"
  • from="1"
  • to="#ArrayLen( arrUrls )#"
  • step="1">
  •  
  • <!--- Check to make sure we have content to check. --->
  • <cfif Len( Trim( arrUrls[ intUrl ] ) )>
  •  
  • <p>
  • URL: #arrUrls[ intUrl ]#<br />
  •  
  • <!---
  • Use the Java String::Matches() method to check
  • to see if the given string fully matches the
  • given regular expression.
  • --->
  • IsUrl: #arrUrls[ intUrl ].Trim().Matches(
  • strIsUrlRegEx
  • )#
  • </p>
  •  
  • </cfif>
  •  
  • </cfloop>

This gives us the output:

URL: http://xxxx.xxxx.com/xxxxxx/xxxxxxx.cfm?cdcrs=RD4823
IsUrl: YES

URL: http://xxxx.xxxx.com/xxxxxx/xxxxxxx.cfm?cdcrs=RD4823&empcode=GUEST43352&empcode=GUEST43352
IsUrl: YES

URL: http://xxxx.xxxx.com/on_job/supplychain/index.html
IsUrl: YES

URL: http://xxxx.xxxx.com
IsUrl: YES

URL: http://xxx.xxxxx.com/xxxxxx
IsUrl: YES

URL: https://xxxxx.xxxxxl.com/xxxxx/xxx/
IsUrl: YES

URL: http://www.bennadel.com/?cool=true
IsUrl: YES

URL: http://ben:cuties@www.hot-and-curvey.com
IsUrl: YES

URL: www.bennadel.com
IsUrl: YES

URL: bennadel.com
IsUrl: YES

URL: ben
IsUrl: NO

URL: htp://www.bennadel.com
IsUrl: NO

URL: http://www.ben nadel.com
IsUrl: NO

URL: http://xxxx.xxxx.com.
IsUrl: NO

URL: http://hotchicks/sarah.htm
IsUrl: NO

URL: http:///test.cfm
IsUrl: NO

URL: http://www.chick-o-matic.com//
IsUrl: NO

URL: http://www.coldfusion-rocks.com/.cfm
IsUrl: NO

URL: http://www.i-heart-coldfusion.com/it-is-sexy.cfm.
IsUrl: NO

URL: http://www.coldfusion-queries.com/how to.cfm
IsUrl: NO

URL: http://joesmith@coldfusion-is-sweet.com
IsUrl: NO

You will notice that all of our good URLs passed the test and that all of the URLs that we wanted to fail, did indeed fail.

Ok, so our regular expression seems to work. But how do we go about auto-linking URLs that are contained within a blob of text. No problem. First, let's create a blob of text to test this on:

  • <!--- Store our test text. --->
  • <cfsavecontent variable="strText">
  • Hey dude, check out this picture from flickr.com:
  • http://flickr.com/photos/powerart/250221374/in/pool-36521969913@N01/.
  • I got it off of the "Geek Girls Are Sexy" photo group which you can
  • access directly at http://flickr.com/groups/36521969913@N01/. I
  • originally grabbed this sweet-ass photo off of page 38
  • (http://flickr.com/groups/36521969913@N01/pool/page38/), but it has
  • probably moved to a different page by now. You might notice that
  • page is also accessible via a query string (as opposed to directory
  • structure) at http://flickr.com/groups/36521969913@N01/pool/?page=38.
  • Anyway, just through I would pass it along. If you don't want to access
  • it through the Flickr interface, you can use Google.com to search for
  • photos on the Flickr site. Check out this link:
  • http://www.google.com/search?q=site%3Aflickr.com+geek+girls&btnG=Search
  • which searches for "geek girls" on the Flick.com site.
  • </cfsavecontent>

This ColdFusion variable contains various different forms of URLs. Now, we can run a Replace on the text using our regular expression. Remember from the regular expression that we grouped the entire match to be references at group ONE. Since we are using the underlying Java string methods, we can reference this group as $1:

  • <p>
  • #strText.Trim().ReplaceAll(
  • strIsUrlRegEx,
  • "<a href=""$1"" target=""_blank"">$1</a>"
  • )#
  • </p>

This gives us the following output:

Hey dude, check out this picture from flickr.com: http://flickr.com/photos/powerart/250221374/in/pool-36521969913@N01/. I got it off of the "Geek Girls Are Sexy" photo group which you can access directly at http://flickr.com/groups/36521969913@N01/. I originally grabbed this sweet-ass photo off of page 38 (http://flickr.com/groups/36521969913@N01/pool/page38/), but it has probably moved to a different page by now. You might notice that page is also accessible via a query string (as opposed to directory structure) at http://flickr.com/groups/36521969913@N01/pool/?page=38. Anyway, just through I would pass it along. If you don't want to access it through the Flickr interface, you can use Google.com to search for photos on the Flickr site. Check out this link:http://www.google.com/search?q=site%3Aflickr.com+geek+girls&btnG=Search which searches for "geek girls" on the Flick.com site.

You will notice that all URLs were linked except for the Google search URL. This URL failed because of Google's invalid file name "search". This seems like it is a directory but does not have the required "/" at the end. If you want to be able to link this, you will have to alter the regular expression above to allow for file names that do not have file extensions.

Additionally, you will notice that it linked both URLs with protocols (such as http://) as well as those without (ex. flickr.com). This is good, as it links the URL, but the problem is that all of these links are external links. The web browser though, without having a protocol to go on, assumes that a link is local to your site. So what it is doing is linking to "flickr.com" as if it were a file in the current directory. We, of course, do not want that. In order to hack a fix for that, we can create an onClick event handler for the link that checks to see if we need to add a protocol before the link get's fired:

  • <p>
  • #strText.Trim().ReplaceAll(
  • strIsUrlRegEx,
  • "<a href=""$1"" target=""_blank"" onclick=""if (this.getAttribute( 'href' ).indexOf( '://' ) == -1){ this.href = ('http://' + this.getAttribute( 'href' )); }"">$1</a>"
  • )#
  • </p>

Running that we get:

Hey dude, check out this picture from flickr.com: http://flickr.com/photos/powerart/250221374/in/pool-36521969913@N01/. I got it off of the "Geek Girls Are Sexy" photo group which you can access directly at http://flickr.com/groups/36521969913@N01/. I originally grabbed this sweet-ass photo off of page 38 (http://flickr.com/groups/36521969913@N01/pool/page38/), but it has probably moved to a different page by now. You might notice that page is also accessible via a query string (as opposed to directory structure) at http://flickr.com/groups/36521969913@N01/pool/?page=38. Anyway, just through I would pass it along. If you don't want to access it through the Flickr interface, you can use Google.com to search for photos on the Flickr site. Check out this link:http://www.google.com/search?q=site%3Aflickr.com+geek+girls&btnG=Search which searches for "geek girls" on the Flick.com site.

If we run this, we will get the same output, but you will notice that all of the links are now external, even if they did not begin with a protocol. Of course, this will fail if someone's Javascript is turned off. If you are worried about that, you can create your own ColdFusion user defined function (UDF) that will check this during the replace (such as one made in conjunction with Java's Pattern Matcher).

As you can see, there is a lot that goes into defining a URL and matching it against a pattern. This explanation above is not fool-proof, but it is meant to demonstrate in a very clear manner one method for auto-linking URLs. If you want to see what the regular expression would look like when it is NOT in a verbose format, we can strip out all of the comments and white space. What we are left with is this (I have broken it up into three lines for no-wrapping but you would have it all on one line):

(?i)(((https?|ftp|gopher)://([^:]+\:[^@]*@)?)?([\d\w\-]+\.)?
[\w\d\-\.]+\.[\w\d]+((/[\w\d\-@%]+)*(/([\w\d\.\_\-@]+\.[\w\d]+)?
(\?[\w\d\?%,\.\/\##!@:=\+~_\-&]*(?<![\.]))?)?)?)

Looking at it that way, you can see the beauty AND the power of the verbose regular expression not only as a teaching tool, but as an essential aspect of writing maintainable code.

NOTE: The above regular expression (my HUGE one at the top) was not working like I would like. I simplified it a bit here:

  • <cfsavecontent variable="strIsUrlRegEx"
  • >(?ix)
  •  
  • ## We are going to wrap the entire match in parenthesis so
  • ## that we can refer to the entire group at match group
  • ## one ($1).
  •  
  • (
  •  
  • ## First, we want to come up with a the web protocol.
  • ## This will probably be HTTP but, let's define some
  • ## others just to make sure we cover the standards.
  •  
  • (
  • (https?)://
  •  
  • ## If we do use a protocol, then we have the option
  • ## to define a username and password for this url.
  • ## Most urls will NOT have this, but sites that use
  • ## Window's authentication (???) use this. Better
  • ## safe than sorry. However, if login credentials
  • ## are used, they must end with "@".
  •  
  • ([^:]+\:[^@]*@)?
  •  
  • )
  •  
  • ## Now, we have to define the sub-domain. This is generally
  • ## the "www" before the domain, but this might include any
  • ## number of values. This entire value is optional,
  • ## however, if this value is used, it MUST end with a "."
  •  
  • ([\d\w\-]+\.)?
  •  
  • ## Now, let's define the domain. This will be any
  • ## combination of values that has a domain extension
  • ## (ex. .com, .edu) and does not yet include a directory
  • ## structure of any kind. This is NOT an optional value.
  •  
  • [\w\d\-\.]+\.(com|net|org|info|biz|tv|co\.uk|de|ro|it)
  •  
  • ## Once the domain and extension are defined, everything
  • ## else is optional. That means that everything after this
  • ## point MIGHT be there, but is not required. Therefore,
  • ## we have to group the rest of this and make it optional.
  •  
  • ## After the domain name and extension comes the
  • ## optional directory structure and file name.
  •  
  • (
  •  
  • ( / [\w\d\.\-@%\\\/:]* )+
  •  
  • )?
  •  
  • ## Now that we have the file name defined, we can
  • ## define the optional query string. The query
  • ## string, if it is used, MUST begin with the
  • ## question mark literal "?".
  •  
  • (
  • \?
  •  
  • ## After the query string delimiter (?), pretty
  • ## much anything is fair game. I don't actually
  • ## know what is ## valid in a URL so I just go
  • ## with this set of characters which is what I
  • ## see a lot.
  •  
  • [\w\d\?%,\.\/\#!@:=\+~_\-&amp;]*
  •  
  • ## Even though the query string can contain
  • ## just about anything, I want to make sure that
  • ## it does NOT end with certain characters. Of
  • ## the characters above, I find that URLs using
  • ## query strings often should not include the
  • ## final "." as that is usually the period for
  • ## the containing sentence. Use a negative
  • ## look-behind to make sure that the previous
  • ## character is not of this set.
  • ## NOTE: This is available in Java version only.
  •  
  • (?<![\.])
  •  
  • )?
  •  
  • ## ^ Ended the optional query string.
  •  
  • ## ^ We just ended the optional post-domain-extension area
  • ## of the url.
  •  
  • )
  •  
  • ## ^ Just ended the convenience group one ($1) that allows
  • ## us to refer the entire URL at the first group.
  •  
  • </cfsavecontent>


Reader Comments

Ha ha, I am actually not using the same regular expression on my blog yet (I wrote this one fresh for the demo and it is MUCH better than the one I use currently).

Reply to this Comment

Just had a thought regarding domain extensions. Right now, there is no constraint on it other than it needs to be at least one character. You can update the regular expression to only allow certain domains ex: (com|edu|gov|net|info|biz|tv). That will not only be more clear into what is being matched, but it will also stop things like file names from being matched as domain names.

Reply to this Comment

Tangentially related, but I'm curious if anyone has written a parser that converts regular expressions to plain English. I admit to getting perverse pleasure from documenting my code, but I'm sure not everyone is in that boat.

You could build a vocabulary along the lines of...
\^ = "starts with"
\[\^.*\] = "does not contain"
\{(\d){0,},(\d){0,}} = "$1 to $2"
...etc.

And a regex like...
^[A-Za-z]{2,4}\d{2,7}$

Would result in an English explanation that could be copied and pasted into the source code.
Match a string that starts with 2 to 4 characters including A-Z and a-z and ends with 2 to 7 digits.

So on and so forth

Match string

Reply to this Comment

That would be cool, but it sound quite difficult. I think for small regular expression formulas it is very doable, but I think once you get more complicated it would become quite difficult.

But then again, if anyone has done that, it would be wicked cool.

Reply to this Comment

Sweet. I'll check it out. Although I was thinking of it more as a fun project to try. :-)

Reply to this Comment

Ben,

nice work as usual! I love the code snippets you always seem to post. I had an idea about a code snippets product that I initially did in ColdFusion - maybe we can collaborate on something like that.

On a more direct note - who the hell uses the "gopher" protocol, he he.

Reply to this Comment

Boyan,

Glad you like. And yeah, definitely let's touch base about your idea. As far as Gopher... I don't even know what it is :) I just put it in there because everyone else seems to put it in there.

Reply to this Comment

Thinking about the whole natural language conversion of regex's more, I realized that in addition to code commenting, it might also be useful for automatic generation of error messages. Then if the validation criteria changes later on, the error messages automatically change instead of having to remember to change umpteen functions or variables in the code. Using the same regex as my previous example, you might get...

"The value you entered for [field name] must start with 2 to 4 characters, including A-Z and a-z, and end with 2 to 7 digits, jackass!"

The jackass is tacked on for good measure. ;-)

Reply to this Comment

It would be a cool feature. Perhaps I can put some time into thinking about it. I have seen basic implementations of this, but they are horribly done.

... you can never go wrong with 'Jackass' :)

Reply to this Comment

Ben, your claim that Google's search URL contains an invalid filename is incorrect. That is a valid URL. The string "search" could be a filename (filename's are not required to have an extension), but much more likely it would be part of the directory path.

Reply to this Comment

Steve,

I use a windows computer and I know that if my file don't have extensions, windows is unhappy and doesn't know what to do with the file. I don't know if it is different for other OS's. As far as it being a directory structure, I don't see how that is possible. Granted, I know nothing about standards, but I assume that it is a standard that when an item inside a directory is referenced, it must have a trailing slash.

Assuming you have a directory name "blam" and a file named "blam" in the same folder folder. If there was no trailing slash in a web app, and this was the code:

blam?q=test

How would the server know if you meant the file blam? or if you were implying the nested default page:

blam/default.htm?q=test

Frankly, something just doesn't smell right to me :) And of course, the regular expression here is for demonstration. I think it becomes harder for validation when you have less validation requirements.

Reply to this Comment

Ben, I use a Windows XP computer, and I am able to create and use files without extensions with no problems (since programs cannot be associated with files without extensions, double-clicking such files simply prompts you for what program to open it with).

Further, on every HTTP server I'm familiar with, the URI "/blam?q=test" will work as I described. The server will first check if a file called "blam" exists, and if not it will treat "blam" as a directory name, and load any existing default file (e.g., index.cfm) within the directory. However, while it is very common (for some people and sites, I guess) to access directories without a trailing slash, working with files without extensions is very rare (in my experience, at least). Hence, in my CF generic URI parser (see http://badassery.blogspot.com/2007/01/parsing-uris-in-coldfusion.html ), I would always treat "/blam" as part of the directory path.

Here's the regex I typically use to validate absolute URIs (when accuracy isn't of the utmost importance):

\b(?:https?|ftp)://(?:[a-z\d-]+\.)+[a-z]{2,6}(?:/\S*)?

It's not foolproof by any means, but on the plus side it's short, easily readable, and generally solid with typical input.

BTW, here is the official RFC regarding generic URI syntax: http://tools.ietf.org/html/rfc3986

Reply to this Comment

Note that your regex failed on the very URI which defines generic URI syntax ( http://tools.ietf.org/html/rfc3986 ). ;-) If you add a slash to the end of it, in this case it actually will not work (I imagine that's because of the way they've set up URI rewriting on their servers).

Reply to this Comment

Steve,

[double-clicking such files simply prompts you for what program to open it with]

.. that's what I mean about throwing a fit. I didn't mean Windows would stop you in your naming.

Also, love the url "badassery". Sweeet. Nice UDF also. As far as the ability to call files like that - functional or not - it just doesn't feel right in my gut. I don't like the idea of the Server having to try completely different files until it finds one that works.

Although, I guess the same could be said about letting the server search for default documents.

I think, moral of the story, Regular expressions rock, as we both know. Most likely, there is no 100% bullet-proof URL matcher for URLs that are embedded within a chunk of text as the surrounding data is just too variable.

Reply to this Comment

Ha ha, think auto linking in the comment uses a very old algorithm I wrote. I still have yet to implement the one I described in this post :D

Reply to this Comment

[I think, moral of the story, Regular expressions rock, as we both know.]

Indeed. :-) As a new reader here, I'm looking forward to digging through your blog archives. Seems like we're most passionate about all the same technologies.....ColdFusion, JavaScript, regular expressions, etc.

BTW, if you haven't seen it already you should take a look at my reMatch() UDF (I just finished adding an intro explaining its functionality, etc.): http://badassery.blogspot.com/2007/01/coldfusion-regex-support-udfs-rematch.html

Reply to this Comment

Hello Ben (first,sorry for my bad english :-),

Thank you for this very interesting "paper".
It works perfectly. Any idea to add "mailto" auto-link to your usefull RegEx? Regards,

Emmanuel (Belgium)

Reply to this Comment

Emmanuel,

First off, the English is totally fine. Secondly, a MailTo link has very different rules than a regular URL. I would think this would require a completely different regular expression. Maybe something like:

(?i)mailto:([^@]+?@[\w\d.-]+\.[\w]{2,5})

I am sure Steve could show you something better perhaps?

[^@]+?

Reluctant match for the email name.

[\w\d.-]+\.[\w]{2,5}

Domain name and extension. Not the most limitting but also perhaps not the most proper. Hope it helps point you in the right direction.

Reply to this Comment

Thank you very much for your help !

Your mailto RE works fine and I try now to integrate it to your usefull auto-linking RE.

Best regards,

Emmanuel

Reply to this Comment

Thanks very much Ben.. exactly what I was looking for.

The only thing missing was the ability to deal with addresses formated like this:

http://www.blahblahblah.com/index.cfm/152/569/

which is used on the site I work on, so I added:

(\/[\w\d]+(\/)?)*

in the part of your code that deals with optional query strings. (I have a fairly limited knowledge of reg expressions although I am starting to realize how brilliant they are!).

By the way, I love your web log. Absolutely essential reading for any Coldfusion Developers!

Reply to this Comment

@Stefan,

Thanks. The above (my code, not yours) actually doesn't work as well. I will post my updated code.

Reply to this Comment

The updated code requires a internet protocol to be used (http, etc) and only matches some domain extension (com, net, org, etc). I found this resulted in less false-matches.

Reply to this Comment

"Assuming you have a directory name "blam" and a file named "blam" in the same folder folder."

Well, you can't. Most operating systems don't really treat files and directories that differently, from a particular point of view. If you try to create a file with the same name as an existing directory, it will fail (or, depending on the command, create the link/file/etc inside the directory). And Vice-versa, you cannot create a directory in the same location as a file with the same name.

Reply to this Comment

Well I'll be damned! I just tried it and you are absolutely correct :) It (Windows XP) won't let me do that.

Reply to this Comment

Almost there:

((?<!(<a href="))((https?|ftp|svn|nntp|file|aim|webcal)://)(www\.)?[][a-z0-9_$!&%?,#@'/.*+;:=~-]+)

Still matches if the URL is the text of a link, though.

Reply to this Comment

@Matt,

Yeah, contextual patterns are very tough for me, especially with something like an HTML tag where its structure can be so different from author to author.

You should check out / contact Steve Levithan:

http://blog.stevenlevithan.com

His RegEx skill are sick :)

Reply to this Comment

Thanks Ben (again)

I had a similar function but it was hanging up on https links, and on anything besides a space at the end of the url. i.e., broken for the most part unless you were careful to put a space between the .com and a period, or the next </p> or <br /> tag... which i can do but my CMS users of course will not!

This took care of it nicely!

Reply to this Comment

i just want to know... if windows recognizes file names, what's the need for all this code? if you write a doc that includes filename.html and you have a file in the folder named filename.html, why would you need any special code to link to filename.html????

what i was looking for, for months, is code that will auto CREATE the link for me in all my documents. For example, where the text in the html document includes a name such as 401 Code 32 and I have a file named 401_32.html or 401.32.html - how would I autoCREATE a link so that the text converts into a link and links to the file? sounds easy. but in months of searching, i found nothing.

Reply to this Comment

@Guy,

I am not sure what you mean by Windows recognizing file names? This post, from what I can remember, is about linking content within a text area (specifically for the web). Really, though, this post was more so an exploration of Verbose regular expression patterns.

When you want to link content, you have to create patterns. So, in your example, you'd have to probably create patterns based on your file names and then replaces those in you content.

Reply to this Comment

Post A Comment

You — Get Out Of My Dreams, Get Into My Comments
Live in the Now
Oops!
Comment Etiquette: Please do not post spam. Please keep the comments on-topic. Please do not post unrelated questions or large chunks of code. And, above all, please be nice to each other - we're trying to have a good conversation here.