Ask Ben: Parsing CSV Strings With Javascript Exec() Regular Expression Command

Posted February 19, 2009 at 10:03 AM

Tags: Javascript / DHTML, Ask Ben

Just curious if you can help me with this.... I have a user enter the following string: website "content approval" new - into a text box and need to break it up so that it is 3 separate words eg:

1) website
2) content approval
3) new

I have tried using the following RegEx - "([^\\"]|\\.)*" but this only gets whats enclosed in the double quotes, how can I also select the other 2 words but excluding what I have already picked out within the double quotes?

When I see your sample string, I immediately think of Comma Separated Values in which the separating character is the space rather than a comma. That's why the middle two words are qualified (quoted) - because we need to escape the space contained within the value. The good news is that I have done a lot of CSV parsing in ColdFusion. So, to help you solve this problem, I have decided to take my CSVToArray() ColdFusion user defined function and port it to Javascript.

Unfortunately, it looks like Javascript does not support atomic grouping; as such, I had to take out some of Steve Levithan's awesome CSV parsing optimizations. But, optimizations aside, the simplified regular expression pattern seems to work fine. And, since we're not going to be parsing huge strings into data arrays, I don't foresee any real problems.

That said, here is the Javascript port of my CSVToArray() UDF:

 Launch code in new window » Download code as text file »

  • <script type="text/javascript">
  •  
  • // This will parse a delimited string into an array of
  • // arrays. The default delimiter is the comma, but this
  • // can be overriden in the second argument.
  • function CSVToArray( strData, strDelimiter ){
  • // Check to see if the delimiter is defined. If not,
  • // then default to comma.
  • strDelimiter = (strDelimiter || ",");
  •  
  • // Create a regular expression to parse the CSV values.
  • var objPattern = new RegExp(
  • (
  • // Delimiters.
  • "(\\" + strDelimiter + "|\\r?\\n|\\r|^)" +
  •  
  • // Quoted fields.
  • "(?:\"([^\"]*(?:\"\"[^\"]*)*)\"|" +
  •  
  • // Standard fields.
  • "([^\"\\" + strDelimiter + "\\r\\n]*))"
  • ),
  • "gi"
  • );
  •  
  •  
  • // Create an array to hold our data. Give the array
  • // a default empty first row.
  • var arrData = [[]];
  •  
  • // Create an array to hold our individual pattern
  • // matching groups.
  • var arrMatches = null;
  •  
  •  
  • // Keep looping over the regular expression matches
  • // until we can no longer find a match.
  • while (arrMatches = objPattern.exec( strData )){
  •  
  • // Get the delimiter that was found.
  • var strMatchedDelimiter = arrMatches[ 1 ];
  •  
  • // Check to see if the given delimiter has a length
  • // (is not the start of string) and if it matches
  • // field delimiter. If id does not, then we know
  • // that this delimiter is a row delimiter.
  • if (
  • strMatchedDelimiter.length &&
  • (strMatchedDelimiter != strDelimiter)
  • ){
  •  
  • // Since we have reached a new row of data,
  • // add an empty row to our data array.
  • arrData.push( [] );
  •  
  • }
  •  
  •  
  • // Now that we have our delimiter out of the way,
  • // let's check to see which kind of value we
  • // captured (quoted or unquoted).
  • if (arrMatches[ 2 ]){
  •  
  • // We found a quoted value. When we capture
  • // this value, unescape any double quotes.
  • var strMatchedValue = arrMatches[ 2 ].replace(
  • new RegExp( "\"\"", "g" ),
  • "\""
  • );
  •  
  • } else {
  •  
  • // We found a non-quoted value.
  • var strMatchedValue = arrMatches[ 3 ];
  •  
  • }
  •  
  •  
  • // Now that we have our value string, let's add
  • // it to the data array.
  • arrData[ arrData.length - 1 ].push( strMatchedValue );
  • }
  •  
  • // Return the parsed data.
  • return( arrData );
  • }
  •  
  • </script>

I decided to go with the Regular Expression Exec() command rather than the Javascript String Replace method simply because I have done so much with the String Replace method lately. Don't worry too much about the regular expression itself - that stuff is meant to be written, not read.

With this new Javascript user defined function in place, let's set up a little test for text box input:

 Launch code in new window » Download code as text file »

  • <form>
  •  
  • <p>
  • <input type="text" value="" size="40"
  • onblur="console.log( CSVToArray( this.value, ' ' ) );"
  • />
  • </p>
  •  
  • </form>

Here, we are allowing the user to enter a value into the text box and onBlur of that input box, we are going to use FireBug to log the CSV parsed value of the entered text. When we run this on your sample data, we get the following:

 
 
 
 
 
 
Parsing CSV Values Using Javascript User Defined Function - CSVToArray(). 
 
 
 

As you can see from the Console, we get an array of arrays. Since there are no line breaks, the CSV array only contains one child array (one child array per row of data). From this, you can see that the first child array holds the three values you need to get at.

I hope this helps.

Download Code Snippet ZIP File

Post Comment  |  Ask Ben  |  Permalink  |  Other Searches  |  Print Page




Learning ColdFusion 9 - ColdFusion 9 tutorials, samples, examples, demos

Reader Comments

Feb 19, 2009 at 11:07 AM // reply »
11 Comments

Couldn't you use the javascript .split() function here to break up the string into an array? It seems much simpler than the regex you suggested.

just an idea.

larry


Feb 19, 2009 at 11:14 AM // reply »
6,516 Comments

@Larry,

The String.split() takes a regex for the delimiter. Seeing as the delimiter in our case (space) can be escaped by surrounding qutoes, I am not sure how doable it is. I am certainly open to am more simple solution though if you have thoughts.


h3
Feb 19, 2009 at 3:36 PM // reply »
1 Comments

I think that's pretty much the RegExp you need: 'website "content approval" new'.match(/^(\w+)\s?"(.*)"\s?(\w+)$/i); Outputs: ["website "content approval" new", "website", "content approval", "new"] You could use ".slice(1)" to pop the first element, which is obviously not needed. cheers


Feb 19, 2009 at 3:43 PM // reply »
6,516 Comments

@H3, Most excellent post! I am not sure if I have ever used the String.match( regex ) function before. I will definitely have to take some more time to look into it. Much appreciated!


Feb 19, 2009 at 4:09 PM // reply »
11 Comments

From my reading the split() function can take either a regex or a simple string. given your string website "content approval" new You'd want to to use " as the delimiter. Then your resulting array would be website content approval new The only difficult part would be parsing the delimiter, that's where the regex would come in. btw what the foo are kinky points? regards, larry


Feb 19, 2009 at 4:12 PM // reply »
11 Comments

it looks like the blog app doesn't like returns, lets try that again<br/><br/> From my reading the split() function can take either a regex or a simple string. given your string <br/><br/>website "content approval" new<br/><br/> You'd want to to use " as the delimiter. Then your resulting array would be <br/><br/> website <br/><br/> content approval <br/><br/> new <br/><br/> The only difficult part would be parsing the delimiter, that's where the regex would come in. <br/><br/> btw what the foo are kinky points? <br/><br/> regards, larry


Feb 19, 2009 at 4:18 PM // reply »
6,516 Comments

@Larry,

The Kinky Points are for the shop:

http://www.bennadel.com/shop/kinky-solutions-shop.htm

Hmm, I just made some commenting updates. They might be stripping out the line breaks. I will get to the bottom of this ASAP!


Feb 19, 2009 at 4:28 PM // reply »
6,516 Comments

@Larry,

Thanks for helping me debug! I guess when I do an AJAX post vs. a standard FORM post the line breaks come through differently (even though they are both "POST" actions).


Feb 19, 2009 at 5:25 PM // reply »
4 Comments

@Ben, Thank you heaps for this.

It has now put me on the right track :)


Feb 20, 2009 at 12:29 PM // reply »
11 Comments

> @Larry,

> Thanks for helping me debug! I guess when I do an AJAX post
> vs. a standard FORM post the line breaks come through
> differently (even though they are both "POST" actions).

Any time Ben. Good to know you got things working again (of course until I post this I really have no clue that it will work properly).

larry


Mar 4, 2009 at 10:28 PM // reply »
2 Comments

Ben, Very useful code! I had a problem when using it with IE where in the special case that the first character is the delimiter, the code fails to store an empty string representing the empty value preceeding the delimiter. So I addeed the following code right before your while loop and it took care of the problem:
if (strData.charAt(0) == strDelimiter)
arrData [arrData.length - 1].push ("")


Mar 5, 2009 at 8:30 AM // reply »
6,516 Comments

@Steve,

Hmm, interesting. The algorithm should capture an empty field in the beginning. Were you getting this in IE only? Or on all browsers?


Mar 5, 2009 at 11:56 AM // reply »
2 Comments

Haven't tried another browser. But in studying the regular expression, I could see how the regexp engine could match the first delimiter rather than the "^" and then match whatever followed--simply skipping over the empty field. I tried to move the "^" to the beginning of the expression, but then the engine skipped over what immediately followed the first delimiter in an attempt not to get stuck in a loop.


Mar 5, 2009 at 12:58 PM // reply »
6,516 Comments

@Steve,

I think you are correct about the ^ being first. I think it was put at the end because it is not the "normal" delimiter. I'll have a play with it, see what I can do.


Rob
Jun 26, 2009 at 1:03 PM // reply »
4 Comments

What if you wanted to parse a string that contained a delimiter, but except for circumstances where it was contained in quotes:

Eg. 'ChartHeight = 400;ChartName = "Test;";ChartWidth = 600';

Here, my delimiter is the colon, but I have quoted values that don't span the distance between actualy delimiters so the function above does not work. I want to get:

array[0] = 'ChartHeight = 400';
array[1] = 'ChartName = "Test;"';
array[2] = 'ChartWidth = 600';

Thanks,

Rob


Jun 26, 2009 at 5:39 PM // reply »
6,516 Comments

@Rob,

When you call the function, are you passing in the delimiter you want to use:

CSVToArray( your_data, ";" )


Rob
Jun 29, 2009 at 11:08 AM // reply »
4 Comments

Ben,

Yes, I'm passing in the colon as the delimiter parameter. The problem is that in your regular expression search, quoted values assume the quoted value is surrounded by delimiters. In this case, we have a quoted delimiter "Test;", but this is preceeded by 'ChartName = '. This results in:

[0] = "ChartHeight = 400"
[1] = "ChartName = "
[2] = ""
[3] = "ChartWidth = 600"

Regards,

Rob


Jul 2, 2009 at 5:42 PM // reply »
2 Comments

How would I use this on an actual .CSV file?


Jul 2, 2009 at 5:56 PM // reply »
2 Comments

This works:
var classList = loadFile("/KC/data/classes.csv");
var classes = CSVToArray( classList , "," );

function loadFile(url)
{
var req = false;
// branch for native XMLHttpRequest object
if (window.XMLHttpRequest && !(window.ActiveXObject))
{
try { req = new XMLHttpRequest() }
catch(e) { req = false }
}
else // branch for IE/Windows ActiveX version
{
if (window.ActiveXObject)
{
try { req = new ActiveXObject("Msxml2.XMLHTTP") }
catch(e) { try { req = new ActiveXObject("Microsoft.XMLHTTP") }
catch(e) { req = false }
}
}
}

if(req)
{
req.open("GET", url, false);
req.send("");
return req.responseText
}
return ''
}


Jul 3, 2009 at 8:42 AM // reply »
6,516 Comments

@Rob,

Ahhh, I see. You are using single quotes as the field qualifier, not double quotes. I am not sure, but I believe that CSV uses double quotes as its standard (but I could be off on that).

I think all you have to do is change the double quotes to single quotes in the regular expression:

// Delimiters.
"(\\" + strDelimiter + "|\\r?\\n|\\r|^)" +

// Quoted fields.
"(?:'([^']*(?:''[^']*)*)'|" +

// Standard fields.
"([^'\\" + strDelimiter + "\\r\\n]*))"

That might work.


Jul 21, 2009 at 3:12 PM // reply »
1 Comments

Since the delimiter is space, you can replace it with ',' using a litle bit of regex: somestring.replace(/(^\s*|\s*$)/, ",");

And then use String.split() to accomplish the task. KISS.


Rob
Jul 21, 2009 at 3:18 PM // reply »
4 Comments

I'm not actually working with typical CSV data. I'm working with a string where there are a number of name-value pairs separated by semicolons. I want to split the string by these semicolons, except where the semicolon is inside of double quotes. Does this make it more clear?


Jul 21, 2009 at 10:31 PM // reply »
2 Comments

ok DB, ever heard of character escaping....first off, escape them from you're data, then it wont be such a brain smasher for you. now if you still cant get it, send it to me, i'll fix, and send it back to you, along with my bill ofcourse.


Jul 22, 2009 at 8:28 AM // reply »
6,516 Comments

@Rob,

Try this:

var data = "ChartHeight = 400;ChartName = \"Test;\";ChartWidth = 600";
var values = data.match( new RegExp( "[^\";\r\n]+(\"[^\"]*\")*", "g" ) );

for( i in values ){
document.write( values[ i ] + "<br />" );
}

This gives you:

ChartHeight = 400
ChartName = "Test;"
ChartWidth = 600


Post Comment  |  Ask Ben

Recent Blog Comments
Nov 20, 2009 at 11:32 PM
Five Months Without Hungarian Notation And I'm Loving It
I've used headless camel case for years for not only ColdFusion variables, but also SQL tables and fields... pretty much everything involving code. I also subscribe to the "don't abbreviate and clea ... read »
Nov 20, 2009 at 11:00 PM
Five Months Without Hungarian Notation And I'm Loving It
@Marcel, Yeah, I always err on the side of longer but more readable variable names. As for the camel casing of CF methods and the headless camel casing of custom items, I get around this by always ... read »
Nov 20, 2009 at 10:56 PM
Five Months Without Hungarian Notation And I'm Loving It
I use the following and love it: my.namespace.MyComponents.functionMethodsOrUDF() CONSTANT_VALUES_OR_PROPERTIES One thing I always try is to CamelCaseBuiltInColdFusionFunctions() so others can tell ... read »
Nov 20, 2009 at 5:38 PM
Learning ColdFusion 8: CFImage Part I - Reading And Writing Images
Hi Ben, Great article. I've been looking around to see if ColdFusion image engine can programatically create the following "wrap around" effect: http://www.creativepro.com/article/photoshop-s-she ... read »
Nov 20, 2009 at 5:35 PM
Maintaining ColdFusion Sessions Across SMS Text Message Requests Without Cookies
@Dave: I talked to Gert he suggested: <cfhttp method="get" url="http://{some cf website}" result="stuff" addtoken="yes" /> Note the addition of cfhttp attribute addtoken. That should persist y ... read »
Nov 20, 2009 at 5:23 PM
Maintaining ColdFusion Sessions Across SMS Text Message Requests Without Cookies
@Todd, Ahh, gotcha, yeah that makes sense. ... read »
Nov 20, 2009 at 5:17 PM
Maintaining ColdFusion Sessions Across SMS Text Message Requests Without Cookies
Ben, sorry if I didn't make this clear. You can make it work like that if you want, just put <cfset session.foo = 1> (and <cfset application.foo = 1>) in your OnRequestStart() and it reve ... read »
Nov 20, 2009 at 5:07 PM
Maintaining ColdFusion Sessions Across SMS Text Message Requests Without Cookies
@Todd, I have seen tidbits about the way Railo handles session. I can understand that it lazy-loads sessions, but I also think that I might make some things more complicated. For example, often tim ... read »