Ben Nadel
On User Experience (UX) Design, JavaScript, ColdFusion, Node.js, Life, and Love.
I am the chief technical officer at InVision App, Inc - a prototyping and collaboration platform for designers, built by designers. I also rock out in JavaScript and ColdFusion 24x7.
Meanwhile on Twitter
Loading latest tweet...
Ben Nadel at cf.Objective() 2013 (Bloomington, MN) with:

Ask Ben: Splitting And Joining Large Binary Files Using Buffers In ColdFusion

Posted by Ben Nadel

I saw a tutorial on your site about splitting and joining files using CF [ColdFusion]. Is there any way to do with without reading in the entire contents of the file part(s) using ReadBinary? On large files this becomes very memory intensive obviously. If you are simply joining parts, why can't you join without having to read and do the checksum at the end to verify?

I know you are asking for a way to split and join files without using ReadBinary at all; and, while I think this is a good question, I wanted to try and address it in a simpler form first. As long as we are splitting files up into smaller parts, why don't we make those parts small enough to be managed by single binary reads. If we have a 2 gig file that we are trying to work with, don't split it up into two 1-gig files - a gig is still quite huge - split it up into many smaller parts. We don't lose anything from having more files. From personal experience, I see that most people split their large files using either 15 meg or 50 meg file sizes.

That being said, I am going to demo the splitting of a huge file using a small buffer such that the entire contents of original file only get read in a bit at a time and written to a smaller part file. Then, to rejoin these smaller files, we are going to assume they are each small enough to be read in via a CFFile [ action = readbinary ]:

  • <!---
  • Get the large target binary file that we want to split
  • up in to several parts. Our demo file is only about
  • 5 megabytes, but this should be sufficient to demo.
  • --->
  • <cfset strTargetFile = ExpandPath( "crazy_girl.jpg" ) />
  •  
  • <!---
  • Set the name of the re-compiled target file. This will
  • be the path to which we recombine all the individual
  • data chunks.
  • --->
  • <cfset strNewTargetFile = ExpandPath( "crazy_girl_new.jpg" ) />
  •  
  •  
  • <!---
  • Set the number of bytes that we want to use for our
  • file chunking size. For our demo purposes (since we don't
  • have a huge file), let's use about a megabyte.
  • --->
  • <cfset intBufferSize = (1024 * 1024) />
  •  
  •  
  • <!---
  • Create a file input stream to read in the chunks of the
  • binary file at a time so that we can split it up.
  • --->
  • <cfset objInputStream = CreateObject(
  • "java",
  • "java.io.FileInputStream"
  • ).Init(
  • JavaCast( "string", strTargetFile )
  • )
  • />
  •  
  • <!---
  • Create a byte buffer into which we will read the file
  • parts. This byte buffer will determine how large the
  • file chunks are. We are going to use the underlying
  • byte array of a string to create our byte array buffer.
  •  
  • Let's make our byte buffer so that it is about a
  • megabyte in size (1024 * 1024 bytes).
  • --->
  • <cfset arrBuffer = RepeatString( " ", intBufferSize ).GetBytes() />
  •  
  •  
  • <!---
  • Now, we want to keep looping over the input stream and
  • reading files until we no longer can read any more data.
  • We are going to use an index loop with a huge max just
  • to use the counter aspect of it.
  • --->
  • <cfloop
  • index="intFileIndex"
  • from="1"
  • to="99999"
  • step="1">
  •  
  • <!--- Read from the input stream. --->
  • <cfset intBytesRead = objInputStream.Read(
  • arrBuffer,
  • JavaCast( "int", 0 ),
  • JavaCast( "int", ArrayLen( arrBuffer ) )
  • ) />
  •  
  • <!---
  • Check to see if we read any bytes from the buffer.
  • If so, then we want to write those to file. If not,
  • then we are done reading data.
  • --->
  • <cfif (intBytesRead GT 0)>
  •  
  • <!---
  • Our buffer contains a certain amount of data. We
  • cannont simply write this buffer to disk in whole
  • because it might not be completely full. Therefore,
  • we cannot use a plain CFFile. Let's create a file
  • output stream so that we can leverage its buffer-
  • using Write() method.
  •  
  • When choosing the file name for this file chunk,
  • use the index value of the current read iteration
  • to creat a "part" file.
  • --->
  • <cfset objOutputStream = CreateObject(
  • "java",
  • "java.io.FileOutputStream"
  • ).Init(
  • JavaCast(
  • "string",
  • "#strTargetFile#.part#intFileIndex#"
  • )
  • )
  • />
  •  
  • <!--- Write our buffer to that file outpu stream. --->
  • <cfset objOutputStream.Write(
  • arrBuffer,
  • JavaCast( "int", 0 ),
  • JavaCast( "int", intBytesRead )
  • ) />
  •  
  • <!--- Close the file output stream. --->
  • <cfset objOutputStream.Close() />
  •  
  • <cfelse>
  •  
  • <!---
  • We are done reading data. Close the file input
  • stream to free it up as a system resource.
  • --->
  • <cfset objInputStream.Close() />
  •  
  • <!--- Break out of read loop. --->
  • <cfbreak />
  •  
  • </cfif>
  •  
  • </cfloop>
  •  
  •  
  •  
  • <!--- END: Split ----------------------------------- --->
  •  
  •  
  •  
  • <!---
  • We have now split our large binary file in to several
  • smaller files. Let's see if we can put it back together
  • again in a new binary file.
  • --->
  •  
  • <!---
  • Again, we don't want to be reading the whole file into
  • memory, so let's create a file output stream to which
  • we can write out smaller file data.
  • --->
  • <cfset objOutputStream = CreateObject(
  • "java",
  • "java.io.FileOutputStream"
  • ).Init(
  • JavaCast( "string", strNewTargetFile )
  • )
  • />
  •  
  •  
  • <!---
  • Now, we want to loop until we no longer can find any
  • smaller chunk files. Sure we could just use the file
  • index found above, but let's do this assuming we don't
  • have any of the data from above.
  • --->
  • <cfloop
  • index="intFileIndex"
  • from="1"
  • to="99999"
  • step="1">
  •  
  • <!--- Get the file name of the next chunk file. --->
  • <cfset strFileName = "#strTargetFile#.part#intFileIndex#" />
  •  
  • <!--- Check to see if which file part exists. --->
  • <cfif FileExists( strFileName )>
  •  
  • <!---
  • Since we knows that these smaller files are not
  • too big, we can simply do a binary read of the
  • complete chunk files into memory.
  • --->
  • <cffile
  • action="readbinary"
  • file="#strFileName#"
  • variable="binFileData"
  • />
  •  
  • <!---
  • Write that file data to our output stream. We are
  • going to pretend that the binar data read we just
  • did was actually a byte buffer.
  • --->
  • <cfset objOutputStream.Write(
  • binFileData,
  • JavaCast( "int", 0 ),
  • JavaCast( "int", ArrayLen( binFileData ) )
  • ) />
  •  
  • <cfelse>
  •  
  • <!---
  • We have finished reading in smaller chunk files.
  • We can now close the new target file output stream
  • which will finalize this process.
  • --->
  • <cfset objOutputStream.Close() />
  •  
  • <!--- Break out of this loop. --->
  • <cfbreak />
  •  
  • </cfif>
  •  
  • </cfloop>

As you can see, for the demo, we are reading in the target file one megabyte at a time using a file input stream. That entire megabyte buffer is then written as a part file. Once this process is complete, we then read each part file using CFFile and write it to a file output stream. This way, no more than a megabyte is ever read into memory at any given time. So, while this might produce a lot of files (only 4 in my demo scenario), they are relatively small. Of course, you can increase the buffer size to reduce the number of files, but don't make it too large or you will eat up your memory.

Again, I realize that this doesn't exactly address your issue, but maybe this will help. Let me know if you want to see a demo of a buffered, large-chunk file solution. You can definitely use buffers to create large, 1-gig chunk files, but it is a bit more complicated. But let me know, and I can show you.




Reader Comments

Thank you for this; it's really helping a lot. I have a large text file that I need to split into two files of roughly the same size, but I need to make sure that the split happens immediately after a linefeed/carriage return. Is there a way to ensure that the split happens after a certain character in the content? I can't just split it by looping through the file as if it were a list of linefeed-delimited items; it's too large.

Reply to this Comment

@Jen,

I guess you'd have to read in the file using some sort of buffered reader; then, when you are at around the right split size, start checking for the line break?

Reply to this Comment

Thanks for replying, Ben. I figured it out - I split the file in half using your code, and then I read the second part, stripped off the text before the first linefeed, and appended that to the end of the first part. Worked like a charm!

Reply to this Comment

Ben,

This was a good read and got me started down the right path. I had a real world application for this so I thought I'd post so others might be able to take advantange of streaming large binary data elements from a column.

<!--- block size increment --->
<cfset variables.blockSize="64000" />

<!--- get an initial read of the data elements needed --->
<cfquery name="getAssetMeta" datasource="[ds]">
select asset, asset_title, asset_size, asset_type, asset_content_type, asset_content_subtype
from denver_assets.[dbo].assets
where pk_asset_id = <cfqueryparam cfsqltype="cf_sql_integer" value="#arguments.doc_id#" />
</cfquery>

<!--- stream the binary from the table and append to the buffer obj --->
<cfset objByteBuffer = CreateObject("java","java.nio.ByteBuffer") />
<cfset objBuffer = objByteBuffer.Allocate(JavaCast( "int", getAssetMeta.asset_size )) />

<cfloop index="variables.offset" from="1" to="#getAssetMeta.asset_size#" step="#variables.blockSize#">

<cfquery name="getAsset" datasource="[ds]">
select
substring(asset,#VARIABLES.offset#,#VARIABLES.blockSize#) as [asset]
from assets
where pk_asset_id = <cfqueryparam cfsqltype="cf_sql_integer" value="#arguments.doc_id#" />
</cfquery>

<cfset byteObj = objBuffer.Put(getAsset.asset) />

</cfloop>

<!--- reset the asset column with the value of the new asset to pass back to calling object --->
<cfset QuerySetCell(getAssetMeta, "asset", byteObj.Array()) />

Hope this helps someone else!
Jason

Reply to this Comment

@Jason,

I'm having a little trouble following your queries. Why are you pulling out the assets a chunk at a time .... Ohhh, is this to get around the query MAX size? I know that a query (in ColdFusion) has limits imposed on it based on some CFAdmin settings; chunking the file would allow you to move more data from the Database into the ColdFusion query memory space without getting "fake" truncation issues.

Is that correct?

Reply to this Comment

@Ben

Sorry for the delay.

Your assumption is correct. I store all of our documents in the database. Some of these documents are as large as 20mb or so. When you increase the query buffer size in cfadmin you start to take on tremendous performance hits.

I will say though, to each their own though but for me I like having my document meta data in the same place as the physical document itself. I hate storing a document in the file system and managing the meta data in the database. In the past it's always led to maintenace issues and desperate data.

Reply to this Comment

@Jason,

No worries my man at all. I think what you're doing is pretty cool; once I figured out why you were splitting the file, I thought it was intriguing. I have been told that storing files in a database can lead to some performance problems, but it's not something I've tried. If it's working for you, it's a cool solution.

Reply to this Comment

Hi Ben,

I have designed a web application to upload files from local machine to a server and download it anywhere in my network.

I have used Apache commons upload jar file for the same. My application is working fine when the file size is 1GB but when my file size exceeds it throws error.
"The field dataFile exceeds its maximum permitted size of 1073741824 bytes."

I think there is limitaion for uploading using the above jar.

But I want to upload large files to server, can I achieve the same by splitting the file and joining it.(I want to upload any kind of file e.g. .exe,.zip etc)

Is there any other way to achieve the same/

Regards,
Anshuman

Reply to this Comment

Post A Comment

?
You — Get Out Of My Dreams, Get Into My Comments
Live in the Now
Oops!
Comment Etiquette: Please do not post spam. Please keep the comments on-topic. Please do not post unrelated questions or large chunks of code. And, above all, please be nice to each other - we're trying to have a good conversation here.