Ben Nadel
On User Experience (UX) Design, JavaScript, ColdFusion, Node.js, Life, and Love.
I am the chief technical officer at InVision App, Inc - a prototyping and collaboration platform for designers, built by designers. I also rock out in JavaScript and ColdFusion 24x7.
Meanwhile on Twitter
Loading latest tweet...
Ben Nadel at cf.Objective() 2009 (Minneapolis, MN) with:

Regular Expressions Make CSV Parsing In ColdFusion So Much Easier (And Faster)

By Ben Nadel on
Tags: ColdFusion

Today, I was working on some data parsing when it hit me like a bolt of lightening! Why not just use Regular Expressions to parse CSV (comma separated values) data files? For those of you who have seen my previous attempts at parsing CSV data files using ColdFusion, you will see that having to handle embedded field qualifiers and data delimiters made it a character by character parsing problem since you can no longer treat it like a list of lists (both comma a new line delimited). It was a huge amount of code to do a huge amount of work.

But then, as I was working today, suddenly I realized that working with embedded field qualifiers was exactly like working with embedded quotes in HTML tag attribute value parsing; it's just a pattern. So, what is the pattern of the CSV file. At it's highest level, it's a field value followed by an optional delimiter. I say "optional" delimiter because the last data field in the file will not be succeeded by a delimiter. The field value is then an optional qualified value containing zero or more characters. The allowable characters in the field are determined be the qualification of the data.

The regular expression I came up with was so short it made me giggle:

("(?:[^"]|"")*"|[^",\r\n]*)(,|\r\n?|\n)?

This ColdFusion compatible regular expression captures two groups; the first group is the field value (zero or more characters) and the second group is the optional succeeding delimiter. If this is totally not readable (as most regular expressions are not), I have converted this to a verbose regular expression (?x) with my typical amount of commenting so that you can really see how the pattern of the data field can be matched:

  • <!--- Create verbose regular expression. --->
  • <cfsavecontent variable="strRegex">(?x)
  •  
  • ## Every CSV "Group" consists of a field value
  • ## followed by an optional delimiter. This delimiter,
  • ## if it exists, will be either a field delimiler or
  • ## a line delimiter. If no delimiter exists, then we
  • ## are at the end of the file.
  •  
  • ## Let's get the field value. We need to consider two cases
  • ## in field values. Either the value is qualified or it is
  • ## not. According to standards, values that have an
  • ## embedded field or line delimiter or qualifier, MUST BE
  • ## contained in a qualified field.
  •  
  • (
  • ## Because the qualified fields are the exception case,
  • ## we want to check for those first. A qualified field
  • ## can contain non quotes AND escaped quotes.
  •  
  • "(?:[^"]|"")*"
  •  
  • | ## OR
  •  
  • ## If we did not match the qualified field token, then
  • ## we need to check for the non-qualified field token
  • ## which can be zero or more characters NOT consisting
  • ## of any qualifier or delimiter.
  •  
  • [^",\r\n]*
  •  
  • )
  •  
  • ## Now that we have captured the zero+ length field value,
  • ## we can get the token. Remember, if we are at the end
  • ## of the file, the token will not exist and therefore,
  • ## this group must be optional (?).
  •  
  • (
  • ## Comma
  •  
  • ,
  •  
  • ## OR return with optional newline.
  •  
  • |\r\n?
  •  
  • ## OR just new line (I include this option for safer
  • ## text values across operating systems (but this might
  • ## not be necessary).
  •  
  • |\n
  • )?
  •  
  • </cfsavecontent>

So now that you see how simple the pattern is, let's use it to start parsing our CSV data in ColdFusion. Luckily, from within ColdFusion, we have access to the Java Pattern and the Java Pattern Matcher which make our lives almost too easy when it comes to iterating over a string. As we iterate over the string, we are going to store our CSV values in an array of arrays in which each sub array will represent a row from the CSV data file.

  • <!---
  • Save CSV data values. Here, we are creating a CSV data
  • value that has both qualified and non-qualified field
  • values, populated and empty field values, and embedded
  • field qualifiers and field/line delimiters.
  • --->
  • <cfsavecontent variable="strCSV">
  • "Name","Nick Name","Age","Hair Color"
  • Kim,"Kim ""Hot Legs"" Smith",24,"Brunette"
  • "Sarah Vivenz, II","Stubs",27,"Brunette"
  • "Kit Williams",Kitty,34,Blonde,,,
  • "Even
  • Values With
  • Embedded Line Breaks"
  • </cfsavecontent>
  •  
  •  
  • <!--- Trim data values. --->
  • <cfset strCSV = Trim( strCSV ) />
  •  
  •  
  • <!---
  • Get the regular expression to match the tokens. I have
  • put the field value on the first line and delimiters on
  • the second line for easier reading.
  • --->
  • <cfset strRegEx = (
  • "(""(?:[^""]|"""")*""|[^"",\r\n]*)" &
  • "(,|\r\n?|\n)?"
  • )/>
  •  
  •  
  • <!---
  • Create a compiled Java regular expression pattern object
  • based on the pattern above.
  • --->
  • <cfset objPattern = CreateObject(
  • "java",
  • "java.util.regex.Pattern"
  • ).Compile(
  • JavaCast( "string", strRegEx )
  • )
  • />
  •  
  • <!---
  • Get the pattern matcher for our target text (the CSV data).
  • This will allows us to iterate over all the data fields.
  • --->
  • <cfset objMatcher = objPattern.Matcher(
  • JavaCast( "string", strCSV )
  • ) />
  •  
  •  
  • <!---
  • Create an array to hold the CSV data. We are going
  • to create an array of arrays in which each nested
  • array represents a row in the CSV data file.
  • --->
  • <cfset arrData = ArrayNew( 1 ) />
  •  
  • <!--- Start off with a new array for the new data. --->
  • <cfset ArrayAppend( arrData, ArrayNew( 1 ) ) />
  •  
  •  
  • <!---
  • Here's where the magic is taking place; we are going
  • to use the Java pattern matcher to iterate over each
  • of the CSV data fields using the regular expression
  • we defined above.
  •  
  • Each match will have at least the field value and
  • possibly an optional trailing delimiter.
  • --->
  • <cfloop condition="objMatcher.Find()">
  •  
  • <!--- Get the field token value. --->
  • <cfset REQUEST.Value = objMatcher.Group(
  • JavaCast( "int", 1 )
  • ) />
  •  
  • <!--- Remove the field qualifiers (if any). --->
  • <cfset REQUEST.Value = REQUEST.Value.ReplaceAll(
  • JavaCast( "string", "^""|""$" ),
  • JavaCast( "string", "" )
  • ) />
  •  
  • <!--- Unesacepe embedded qualifiers (if any). --->
  • <cfset REQUEST.Value = REQUEST.Value.ReplaceAll(
  • JavaCast( "string", "(""){2}" ),
  • JavaCast( "string", "$1" )
  • ) />
  •  
  • <!--- Add the field value to the row array. --->
  • <cfset ArrayAppend(
  • arrData[ ArrayLen( arrData ) ],
  • REQUEST.Value
  • ) />
  •  
  •  
  • <!---
  • Get the delimiter. If no delimiter group was matched,
  • this will destroy the variable in the REQUEST scope.
  • --->
  • <cfset REQUEST.Delimiter = objMatcher.Group(
  • JavaCast( "int", 2 )
  • ) />
  •  
  •  
  • <!--- Check for delimiter. --->
  • <cfif StructKeyExists( REQUEST, "Delimiter" )>
  •  
  • <!---
  • Check to see if we need to start a new array to
  • hold the next row of data. We need to do this if the
  • delimiter we just found is NOT a field delimiter.
  • --->
  • <cfif (REQUEST.Delimiter NEQ ",")>
  •  
  • <!--- Start new row data array. --->
  • <cfset ArrayAppend(
  • arrData,
  • ArrayNew( 1 )
  • ) />
  •  
  • </cfif>
  •  
  • <cfelse>
  •  
  • <!---
  • If there is no delimiter, then we are done parsing
  • the CSV file data. Break out rather than just ending
  • the loop to make sure we don't get any extra data.
  • --->
  • <cfbreak />
  •  
  • </cfif>
  •  
  • </cfloop>
  •  
  •  
  • <!--- Dump out CSV data array. --->
  • <cfdump
  • var="#arrData#"
  • label="CSV File Data"
  • />

There's a lot of commenting going on there (typical Me!), but if you filter all that out, this is only a few lines of code. That's how easy regular expressions make our lives! Compare that with any of my previous ColdFusion CSV parsing solutions to see that this one is a small fraction of the size. And, best part is, it works! Running the above code, we get the following CFDump output:


 
 
 

 
Parsing CSV Values In ColdFusion Using Regular Expressions  
 
 
 

It works on a small file, sure piece of cake, but what about performance? Sometimes we (ok, maybe just me) get sucked so hard into how sexy Regular Expressions are, that we lose sight of the big picture and we end up going down a path that is less performant in the long run. To test this, I am going to create a beefy CSV file with this code:

  • <!--- Create data file path. --->
  • <cfset strCSVPath = ExpandPath( "data.csv" ) />
  •  
  • <!---
  • Create sample data for a web site. Make this data beefy
  • to simulate what kind of stuff might actually be coming
  • out of a system-generated CSV file.
  • --->
  • <cfsavecontent variable="strCSV">"Kinky Solutions: A Student's Perspective","http://www.bennadel.com","Ben J. Nadel","ColdFusion,Web Development,CSS,Javascript,jQuery,SQL,SEO Optimization",,,09/21/2007 07:31:15,true,true,false,"5,340","34",,true</cfsavecontent>
  •  
  • <!--- Repeat string 10,000 times. --->
  • <cfset strCSV = RepeatString(
  • strCSV & Chr( 13 ) & Chr( 10 ),
  • (10 * 1000)
  • ) />
  •  
  • <!--- Write data to file. --->
  • <cffile
  • action="write"
  • file="#strCSVPath#"
  • output="#strCSV#"
  • />

This creates a 10,000 line CSV file with over 2,150,000 characters. This file comes out to be just over 2 megabytes in size. Now, to make sure I am not messing with the time trials, I am running this script separately than the parsing code.

Now for the test! I ran the current version that employs regular expressions to do the parsing against my previous version which checks characters.

Previous CSV Parsing Algorithm

Average parse time: 100,000 ms (100 seconds)

Regular Expression CSV Parsing Algorithm

Average parse time: 11,000 ms (11 seconds)

Holy Cow! The regular expression solution is about 10 times faster than the previous solution! And, this was no rinky-dinky example; this was a seriously large CSV file (at least in my world).

Just for fun, I regenerated the above test CSV file, except this time, I created a 50,000 record CSV file. This file has 10,750,000 characters in it and comes out at well over 10 megabytes. I didn't even want to bother running this in the old solution. However, I am very pleased to say that all 50,000 records were successfully parsed in just 57,000 ms (57 seconds)! That's pretty sweet-ass-sweet.

Let's stop for a second and just look at the performance scaling. The first run went 10,000 records in 11,000 ms. That comes out to be 909 records per second. The second run went 50,000 records in 57,000 ms. That comes out to be 877 records per second. As the data set grows, this thing is just gonna EAT UP MEMORY (creating this massive array in RAM), but as you can see, the efficiency of the algorithm is fairly constant. This is some awesome news.


 
 
 

 
Is It Hot In Here? Or Is It Just Your Use Of Regular Expressions?  
 
 
 



Reader Comments

Holy Pseudo-Random Strings Of Characters, Batman! :)

Firstoff, very cool.

Secondly, I have a question that will wither mean I'm starting to understand Regular Expressions or... well, not.

It looks to me like you've got two groupings that create unused backreferences, namely ("(?:[^"]|"")*"|[^",\r\n]*) and (,|\r\n?|\n) If I understand this correctly, wouldn't (?:"(?:[^"]|"")*"|[^",\r\n]*)(?:,|\r\n?|\n) give better performance because you're not creating unused backreferences?

Or am I missing something?

Reply to this Comment

@Matt,

True, the groups do create back references that can use within the current regular expression. However, it makes the matched pattern much more convenient when grabbing the matched text when you capture the values.

If none of the groups were capturing, I could still grab the whole matching using:

objMatcher.Group()

However, it would not be as easy to figure out where the field value ends and the !optional! delimiter begins. By allowing the field and the delimiter to be captured via a group, then it makes it much more convenient to use:

objMatcher.Group( 1 )

... to get the field value, and

objMatcher.Group( 2 )

... to get the optional delimiter.

So yes, it might be faster in the pattern to not capture, but overall, capturing the groups make the algorithm easier to code and to understand.

Reply to this Comment

Here, we use the java Ostermiller CSV parser
http://ostermiller.org/utils/CSV.html

On my test server, your code running at around 4000 - 5000ms
Ostermiller's is under 300ms

We do A LOT of data importing and exporting, so I've spent tons and tons of time looking for the best way to parse files.

The only CF I have to deal with is to call a function:

<cfscript>
function parseCSV(csvPath) {
jFIS = createObject( "java", "java.io.FileInputStream" );
jFIS.init("#ARGUMENTS.csvPath#");

csvParser = createObject( "java", "com.Ostermiller.util.CSVParser" );
csvParser.init( jFIS );
csvAsArray = csvParser.getAllValues();
csvParser.close();

jFIS.close();

return csvAsArray;
}
</cfscript>

<cfset csvArray = parseCSV(strCSVPath)>

Reply to this Comment

@Jason,

Yeah, a real Java package is probably going to outperform anything that I write. But keep in mind that mine runs basic ColdFusion without any additions. I assume that for the CSVParser to work, I would have to add a Java package to the ColdFusion install. This might not be a problem for most people, but I do not have access to do that sort of stuff.

But regardless, thanks for testing it up against the "big boys". Good to see where I fall in line :)

Reply to this Comment

Hello Chaps,

I do a fair amount, well, in fact, I do a large amount of CSV data parsing through a bunch of web services and things like that, and this is a very interesting topic for me.

Untill now I've pretty much cut CF out of the loop entirely as I dont really need to do any real 'manipulation' on the data, I just want to get it parsed into my database, so in the past I've always used a DTS package for SQL Server to do this for me, and more recently now I'm running 2k5 just opted for a stored proc that recieves a string for the file path of the CSV, it then does a BULK LOAD on the file, which seems to work very nicely.

Am I still handling this in the best fasion? or would one of these JAVA classes mixed up with an insert type query be my best bet?

I'm still betting that SQL handles this stuff infinatly faster than CF.

Rob

Reply to this Comment

Ben, since at http://www.bennadel.com/index.cfm?dax=blog:410.view you asked me to see if I had any suggestions for your CSV parser here, in Java I would use something like the following:

\G(,|\r?\n|\r|^)(?:"([^"]*+(?>""[^"]*+)*)"|([^",\r\n]*+))

While the above is no longer ColdFusion-compatible, I assume that's not a big deal since you're using Java's regex engine anyway. The somewhat more advanced features it uses include possessive quantifiers, an atomic group, Jeffrey Friedl's "unrolling the loop" pattern, and Java's "\G" (just as an extra assurance that the regex's bump-along mechanism always puts us where we want to be). I've also used two different sets of parentheses to avoid having to remove the outer quotes from quoted values in post-processing. Instead, we can just check if a value was captured to backreference 2, and if so, use it, otherwise use backreference 3.

One other thing I would recommend after a brief lookover is changing the replacement of (""){2} with $1 to simply replacing "" with ". Also, if you're going to use a regex to do that, compile the regex outside the loop.

Reply to this Comment

Dude - You're a star.. Lovin' this code. I added The following code to convert the array to a query:-

<!--- Convert to a query --->
<cfset columnList = arrayToList(arrData[1])>
<cfset qryNew = queryNew(columnList)>
<cfloop from="2" to="#arrayLen(arrData)#" index="ptr">
<cfset itemArray = arrData[ptr]>
<cfset queryAddRow(qryNew)>
<cfloop from="1" to="#arrayLen(itemArray)#" index="ptr2">
<cfset querySetCell(qryNew, "#listGetAt(columnList, ptr2)#", itemArray[ptr2], qryNew.recordcount)>
</cfloop>
</cfloop>

Reply to this Comment

@Martin,

Glad you like it. Just watch out with your array to query algorithm as the array is not guaranteed to have all the sub-arrays the same length. Just be careful of the query dimensions.

Reply to this Comment

Ben,

Thank you for forwarding me to this article. Works even faster then the last and with less code!!!

one quick question, why can't i assign vars to each row in the cfloop output? Usually the following works...

Can you throw in an example how i can set a var to each row so that i can insert into the db? Thanks in advance

This isn't working:

<cfloop from="1" to="#arrayLen(arrData)#" index="i">
<cfoutput>#arrData[i]#</cfoutput> <br />
</cfloop>

Reply to this Comment

Ben,

I downloaded the UDF and the array dump of my CSV is working great. I am just having a problem setting variables to each column in each row in the csv.

For my particular CSV file, I have 19 columns and each column is shown in the dump but what var name do i use to output that data in a loop?

see below. this is the copy/paste of the dump. I'm sure it's easy but how do what syntax do i use to set a var to each row in the array? The normal methods i use are not working right... thanks.

array
1 501
2 9938
3 2002
4 BMW
5 128i
6 Sedan
7 1B7HL48X62S557215
8 0
9 24449
10 22000
11 Red
12 Tan, blue
13 Automatic
14 [empty string]
15 Description 1
16 Car
17 V6
18 Front Wheel Drive
19 Gasoline

Reply to this Comment

@Rob,

The result of the UDF is an array of arrays. Therefore, you need to refer to both indexes:

#arrData[ intRowIndex ][ intFieldIndex ]#

I think the problem is that you are only referring to the row index. That gives you another array which requires a second index.

Reply to this Comment

Yep! That's it. Thanks, i'm used to using structures within arrays and that seemed tricky for me at first

<cfoutput>#arrData[1][1]#</cfoutput>

etc...

Great! I can slee well tonight now :)

You da man.

Reply to this Comment

Great piece of code Ben. I searched and experimented with all kinds of methods found online; this was a godsend and by for the easiest solution!

Reply to this Comment

I needed a fast solution to import a csv list of countries and found your UDF. But did not manage to implement it.

Maybe it´s very simple to call a UDF. I placed it in the same folder from where I called it but no luck. Searched Adobe Knowledge base to found out exactly where to put it, but run out of time so I skiped it and instead used: http://sourceforge.net/projects/csvtosql, then I also had a sql dump file for mySQL server.

I agree that regular expressions can be realy handy sometimes!

Reply to this Comment

Thanks Ben,

This bit of code saved me a lot of work while battling with a 68000 line CSV file. I did modify it slightly as I was reading in the file line by line, so only needed the one dimensional array that would be replaced come the next line of the file. Certainly saved on memory usage.

I blogged about using it while battling a memory leak issue relating to processing the data from that CSV.
http://misterdai.wordpress.com/2009/06/24/when-not-to-use-cfqueryparam/

Cheers!
Dave (Mister Dai)

Reply to this Comment

How do you handle/detect bad data, or where different rows, may not have the right number of fields?

Or does it just assume all incoming data is correct, after parsing it through regex?

Reply to this Comment

@Craig,

That's an interesting question. CSV is a "standard". If someone where to deviate from the standard, I am not sure what to do. The regex might just not match, or find things in the incorrect rows. Really, this depends on valid data (just as your compiler depends on valid code).

Reply to this Comment

The reason being, is that many csv projects, have where you can't control the validity of the output, which is why we have to do some validity checking...

I am working on a project of importing csv files, and i am trying to discover ways, to help clarify what is correct and incorrect rows of data.

Because sadly you can't always assume that the incoming data is pristine and so i am just trying to take precautions...

I am also worried about memory leaks, and performance issues, and so forth...

Has there been any studies done on both the different ways to parse csv, and what methods more speedily import the data?

I am most likely going to be doing most of my cleaning and validating of data, and breaking the files into smaller chunks using your really nice code...

But once i have valid and clean and pristine csv data, i will most likely have to use sql server stored procedure to automate that part...

Was hoping to hear more best practices from other cfers that do lots of csv importing.

As well as your own experiences...

Thank You.

Reply to this Comment

@Craig,

I know there are some very powerful database-oriented ways to import the CSV data, but I don't have much experience with that. I use this kind of method and it seems to work well, but I do more exporting than importing.

Sorry I don't have more "best practice" type advice to offer.

Reply to this Comment

@Craig,

You may want to have a look at my blog post on my battles with extremely large CSV file and memory leakage (all due to compnents, cfquery and cfqueryparam usage, not Ben's CSV parsing routine). http://bit.ly/ZknIx

As far as validation is concerned, I have to take an all or nothing approach. Everything was wrapped in a cftransaction / cftry with a rollback if there were any errors. Any data which wasn't valid (e.g. date, barcode) would throw an error.

For minimum memory usage, use a stored procedure, read the CSV line-by-line (you can easily modify the CSV regex to do so), avoid CFQueryParam if possible, use CFQuery for calling the stored proc, also try to buffer your SQL into an array and execute it every 50 CSV rows (tweak this value for best performance).

As for best practices, I'm not sure of many but I'd say you're better off using the database to do the CSV importing if possible. Maybe handling just validation in CF and then passing it over to the Database for the actual import process.

Reply to this Comment

@David,

How does a database-csv-import work? Do you basically set up a datasource to a given file name? And then have CF place the new CSV in that file... which is how I think using an Excel file as a datasource works (although I haven't done that in years).

Reply to this Comment

@Ben,

Well for MS SQL Server 2000+ there is "BULK INSERT". To use it you'd have to validate the CSV file with ColdFusion first, copy it to a drive on the database server and then via CF just use the CFQuery tag to execute the bulk insert SQL.

http://msdn.microsoft.com/en-us/library/aa225968%28SQL.80%29.aspx

The only problem I can see with this is that it doesn't comply to CSV standards and only works with a single delimiter, no support for double quotes and extra commas. So it might need the CSV pre-processed to replace the comma delimiter with a control character and remove the double quotes where applicable.

Although I have seen something mentioned about a tool called bcp and providing it with file format information. But that would have to use a CFExecute :(

Hope this makes sense :)

Reply to this Comment

How would you modify this code if your delimiter was the hat (^) character? I ran the CSV function defining the delimiter and it worked great. I would like to try this one as well.

Thanks for all your hard work and sharing.

Reply to this Comment

@Dwayne,

You would need to add the "^" character to the list of delimiters. The only thing to be careful of is that "^" is a special character in Regular Expressions. As such, you would need to escape it:

... In character groups (since it is, by default, a "negation" selector):

[\^]

... In standard patterns (since it is, by default, a "string start" delimiter):

\^

Reply to this Comment

Exactly what I have looking for. I've ported this to ActionScript3 for using it in Flex Applications and it works great! Thanks a lot!

Reply to this Comment

@Artur,

Oh awesome! That's so cool to hear this works well in action script as well. Badass.

Reply to this Comment

I get a java heap error on a 4000 line x 8 col csv file. (Processing only, not doing anything with the result yet.)

If I cut the file in half it works.

Why does it use so much memory? Can anything be done to the function to improve that?

Reply to this Comment

@Ziggy,

4000 lines doesn't seem like too much. I guess string parsing takes up more memory that I assume it does. I am not sure what to tell you about that. I suppose you could try using a buffered input stream and parse the file a bit at a time. Definitely more complicated that way though.

Reply to this Comment

Thanks. Before I try that, this is my local computer, Railo on Tomcat - can anyone explain how to check and increase the memory for it? Maybe it's too small but I'm not familiar with managing either or tweaking any Java settings.

Reply to this Comment

Btw, you might want to change the big comments at the top of your code samples to cf comments so it can all be pasted in as-is.

Reply to this Comment

I found the settings for Tomcat and seems to be working now. Set it to 64mb, 256mb, 256kb. (No idea what it was before, the lines in the Windows gui were blank.)

Reply to this Comment

@Ziggy,

Glad the RAM increase worked; sorry that the algorithm was chewing up so much memory.

As far as the comments at the top of the copy-n-paste file, the problem right now is that not all code files are actual ColdFusion (some HTML, some Javascript, some SQL, etc.) As such, I went with a "pseduo comment" since there is nothing in the code right now that differentiates between types of code samples.

Hopefully, one day, I'll make my code snippet stuff more robust.

Reply to this Comment

I'm back, sorry.

I have this working locally on Railo but it doesn't work onsite with BD JX7 (7, not 7.01 or 7.1). The array is outputs is very odd, almost entirely blank then 1 value in an array after 46 rows, and some other odd ones lower down.

(Btw, had to move the comments just after line 123 outside the function or BD says badly formatted.)

Any workarounds or some unresolvable problem with BD7?

Reply to this Comment

If it helps, it outputs an array for every row correctly, and the first 16 have the correct number of columns, then it drops to one column short and later correct again, but no discernible reason why as columns on the csv are missing values throughout, for example.

And throughout they are almost entirely all blank, "[empty string". Even the first row of titles shows up as blanks. Then oddly there is a value in the 2nd column every 40-50+ rows or so.

Seems like a regex issue, no?

Reply to this Comment

@Ziggy,

Hmm, that sounds odd. Regex is something that should be pretty universal, at least amongst the major "flavors". Unfortunately, I know nothing about BlueDragon, so I can't offer any insight on why it would work in Railo and not in BD. Is it possible that BD uses a .NET regex library and Railo / CF use the Java regx library? I don't really know. Sorry.

Reply to this Comment

I couldn't figure it out but your older csv parser version worked on BD7 after again moving some comments out of the middle of a function. BD chokes on that.

Thanks!

Reply to this Comment

@Ziggy,

That is interesting. I am not sure why some of the comments would be breaking BD. I used to put comments IN actual methods calls. I think that even killed ColdFusion *some* of the time. But, I stopped doing that.

Reply to this Comment

Hey, Ben.

I refer to this article so often, I figured I ought to at least let you know that there's a free beer waiting for you if you ever make it to Indy. :)

I do a pretty decent amount of .csv parsing these days, and although I have to modify your RegEx on occasion I tend to use this as a starting point.

Reply to this Comment

@Matt,

That's awesome to hear. Sorry for the long response delay - some of these things just get lost. You ever go to BFusion / BFLEX? Not sure if I am gonna be able to make it this year (probably depends on how much Simon Free will hassle me ;)).

Reply to this Comment

Great article, but I'm shocked at how many smart people don't know how to spell lightning. Nothing hit you like a bolt of lightening. Lightening is what you do when you want to make your teeth brighter... you lighten them. The electricity in the sky during a thunderstorm is lightning. No "E" in the word.

Reply to this Comment

Hello. I would like to use ColdFusion to run the java parsing routine on this web site:

http://www.amugofjava.me.uk/2010/02/parsing-sharepoint-metadata.html

How do I do this in ColdFusion? I tried plugging this code directly into cfscript but no go. Looking for ideas on how to parse SharePoint ows_MetaInfo data into discreet name/value pairs. Here is an example of some meta data I need to parse:

vti_parserversion:SR|12.0.0.6535 ImgValidState:IW|1 Order:DW|198400.000000000 ImgYResolution:DW|600.000000000000 XapPhotographer:SW|Courtesy of B. Bagatto, Department of Biology, University of Akron vti_author:SR|Some Person ImgXResolution:DW|600.000000000000 ImgWidth:IW|7735 vti_lastwidth:IX|7735 ImgType:SW|JPEG ImgFormat:SW|RGB-24 IptcTitle_WriteBack:SW|Armadillo (Dasypus novemcinctus) IptcCaption_WriteBack:SW|Nine-banded armadillo. Researchers have sequenced this animal's genome. vti_modifiedby:SR|some person ImgHeight:IW|5119 ImgContentHash:SW|66723748029444162 XapRightsWebStatement:SW|Public ContentType:SW|Photos ContentTypeId:SW|0x00F8830F3CFD01084E8081B3978B01DDCD vti_lastheight:IX|5119 ImgResolutionUnits:SW|pixels/inch Props:IW|14704

All this gets spit out into one gigantic field. SharePoint web service does not provide discreet name/value pairs for this data. I'm using CFSHAREPOINT a lot lately. Thanks.

Reply to this Comment

As usual, I finally figured it out on my own, but this thread provided some key info. Anyway, for anyone who needs to parse through SharePoint's "ows_MetaInfo" field when you're calling getlistitems, here is the code that gave me the name/value pairs:

<cfsavecontent variable="metaText">vti_parserversion:SR|12.0.0.6535 ImgValidState:IW|1 Order:DW|198400.000000000 ImgYResolution:DW|600.000000000000 XapPhotographer:SW|Courtesy of B. Bagatto, Department of Biology, University of Akron vti_author:SR|Some Person ImgXResolution:DW|600.000000000000 ImgWidth:IW|7735 vti_lastwidth:IX|7735 ImgType:SW|JPEG ImgFormat:SW|RGB-24 IptcTitle_WriteBack:SW|Armadillo (Dasypus novemcinctus) IptcCaption_WriteBack:SW|Nine-banded armadillo. Researchers have sequenced this animal's genome. vti_modifiedby:SR|some person ImgHeight:IW|5119 ImgContentHash:SW|66723748029444162 XapRightsWebStatement:SW|Public ContentType:SW|Photos ContentTypeId:SW|0x00F8830F3CFD01084E8081B3978B01DDCD vti_lastheight:IX|5119 ImgResolutionUnits:SW|pixels/inch Props:IW|14704
</cfsavecontent>

<cfset test = #REMatchNoCase("(\w*):\w{2}\|", metaText)#>

<!--- I got the regular express from http://www.amugofjava.me.uk/2010/02/parsing-sharepoint-metadata.html There was no way I would have ever figured that out! --->

<cfoutput>
<cfset count = 0>
<cfloop array="#test#" index="i">
<cfset count = count+1>
<cfset nameLen = #Len(i)#>
<cfset startAt = #Find(i, metatext)#+nameLen>
<cfif ArrayIsDefined(test, count+1)>
<cfset endAt = #Find(test[count+1], metatext)#>
<cfelse>
<cfset endAt = #len(metaText)#>
</cfif>
<cfset valueLen = endAt-startAt>
<cfset string = #Mid(metaText, startAt, valueLen)#>
#Left(i, nameLen-4)# = #string#
<br>
</cfloop>

Hope this helps somebody.

Reply to this Comment

Here's my final version where I throw the name/value pairs into a structure and get to it that way. So easy!

<cfset splitter = #REMatchNoCase("(\w*):\w{2}\|", metainfo)#>

<cfset count = 0>
<cfloop array="#splitter#" index="i">
<cfset count = count+1>
<cfset nameLen = #Len(i)#>
<cfset startAt = #Find(i, metainfo)#+nameLen>
<cfif ArrayIsDefined(splitter, count+1)>
<cfset endAt = #Find(splitter[count+1], metainfo)#>
<cfelse>
<cfset endAt = #len(metainfo)#>
</cfif>
<cfset valueLen = endAt-startAt>
<cfset name = "#Left(i, nameLen-4)#">
<cfset value = "#Mid(metainfo, startAt, valueLen)#">
<cfset meta["#name#"] = #value#>
</cfloop>

Then you just access like so:

<cfparam name="meta.IptcTitle_WriteBack" default="0">

<cfif meta.IptcTitle_WriteBack neq 0>#meta.IptcTitle_WriteBack#<cfelse></cfif>

Reply to this Comment

I tried using your RegEx in my C# program, but it was matching an extra empty-string at the end and so I would end up with an extra field that doesn't exist, so I changed it to this:

(^|,)("(?:[^"]|"")*"|[^,\r\n]*)

It finds beginning string or comma first, then finds the quoted text or finds non-quoted text ending with comma or newline.

What do you think? (I'm new at this)

Reply to this Comment

Post A Comment

You — Get Out Of My Dreams, Get Into My Comments
Live in the Now
Oops!
Comment Etiquette: Please do not post spam. Please keep the comments on-topic. Please do not post unrelated questions or large chunks of code. And, above all, please be nice to each other - we're trying to have a good conversation here.