Ask Ben: Splitting And Joining Large Binary Files Using Buffers In ColdFusion

By Ben Nadel on June 3, 2008

I saw a tutorial on your site about splitting and joining files using CF [ColdFusion]. Is there any way to do with without reading in the entire contents of the file part(s) using ReadBinary? On large files this becomes very memory intensive obviously. If you are simply joining parts, why can't you join without having to read and do the checksum at the end to verify?

I know you are asking for a way to split and join files without using ReadBinary at all; and, while I think this is a good question, I wanted to try and address it in a simpler form first. As long as we are splitting files up into smaller parts, why don't we make those parts small enough to be managed by single binary reads. If we have a 2 gig file that we are trying to work with, don't split it up into two 1-gig files - a gig is still quite huge - split it up into many smaller parts. We don't lose anything from having more files. From personal experience, I see that most people split their large files using either 15 meg or 50 meg file sizes.

That being said, I am going to demo the splitting of a huge file using a small buffer such that the entire contents of original file only get read in a bit at a time and written to a smaller part file. Then, to rejoin these smaller files, we are going to assume they are each small enough to be read in via a CFFile [ action = readbinary ]:

<!---
	Get the large target binary file that we want to split
	up in to several parts. Our demo file is only about
	5 megabytes, but this should be sufficient to demo.
--->
<cfset strTargetFile = ExpandPath( "crazy_insane.jpg" ) />

<!---
	Set the name of the re-compiled target file. This will
	be the path to which we recombine all the individual
	data chunks.
--->
<cfset strNewTargetFile = ExpandPath( "crazy_insane_new.jpg" ) />


<!---
	Set the number of bytes that we want to use for our
	file chunking size. For our demo purposes (since we don't
	have a huge file), let's use about a megabyte.
--->
<cfset intBufferSize = (1024 * 1024) />


<!---
	Create a file input stream to read in the chunks of the
	binary file at a time so that we can split it up.
--->
<cfset objInputStream = CreateObject(
	"java",
	"java.io.FileInputStream"
	).Init(
		JavaCast( "string", strTargetFile )
		)
	/>

<!---
	Create a byte buffer into which we will read the file
	parts. This byte buffer will determine how large the
	file chunks are. We are going to use the underlying
	byte array of a string to create our byte array buffer.

	Let's make our byte buffer so that it is about a
	megabyte in size (1024 * 1024 bytes).
--->
<cfset arrBuffer = RepeatString( " ", intBufferSize ).GetBytes() />


<!---
	Now, we want to keep looping over the input stream and
	reading files until we no longer can read any more data.
	We are going to use an index loop with a huge max just
	to use the counter aspect of it.
--->
<cfloop
	index="intFileIndex"
	from="1"
	to="99999"
	step="1">

	<!--- Read from the input stream. --->
	<cfset intBytesRead = objInputStream.Read(
		arrBuffer,
		JavaCast( "int", 0 ),
		JavaCast( "int", ArrayLen( arrBuffer ) )
		) />

	<!---
		Check to see if we read any bytes from the buffer.
		If so, then we want to write those to file. If not,
		then we are done reading data.
	--->
	<cfif (intBytesRead GT 0)>

		<!---
			Our buffer contains a certain amount of data. We
			cannont simply write this buffer to disk in whole
			because it might not be completely full. Therefore,
			we cannot use a plain CFFile. Let's create a file
			output stream so that we can leverage its buffer-
			using Write() method.

			When choosing the file name for this file chunk,
			use the index value of the current read iteration
			to creat a "part" file.
		--->
		<cfset objOutputStream = CreateObject(
			"java",
			"java.io.FileOutputStream"
			).Init(
				JavaCast(
					"string",
					"#strTargetFile#.part#intFileIndex#"
					)
				)
			/>

		<!--- Write our buffer to that file outpu stream. --->
		<cfset objOutputStream.Write(
			arrBuffer,
			JavaCast( "int", 0 ),
			JavaCast( "int", intBytesRead )
			) />

		<!--- Close the file output stream. --->
		<cfset objOutputStream.Close() />

	<cfelse>

		<!---
			We are done reading data. Close the file input
			stream to free it up as a system resource.
		--->
		<cfset objInputStream.Close() />

		<!--- Break out of read loop. --->
		<cfbreak />

	</cfif>

</cfloop>



<!--- END: Split ----------------------------------- --->



<!---
	We have now split our large binary file in to several
	smaller files. Let's see if we can put it back together
	again in a new binary file.
--->

<!---
	Again, we don't want to be reading the whole file into
	memory, so let's create a file output stream to which
	we can write out smaller file data.
--->
<cfset objOutputStream = CreateObject(
	"java",
	"java.io.FileOutputStream"
	).Init(
		JavaCast( "string", strNewTargetFile )
		)
	/>


<!---
	Now, we want to loop until we no longer can find any
	smaller chunk files. Sure we could just use the file
	index found above, but let's do this assuming we don't
	have any of the data from above.
--->
<cfloop
	index="intFileIndex"
	from="1"
	to="99999"
	step="1">

	<!--- Get the file name of the next chunk file. --->
	<cfset strFileName = "#strTargetFile#.part#intFileIndex#" />

	<!--- Check to see if which file part exists. --->
	<cfif FileExists( strFileName )>

		<!---
			Since we knows that these smaller files are not
			too big, we can simply do a binary read of the
			complete chunk files into memory.
		--->
		<cffile
			action="readbinary"
			file="#strFileName#"
			variable="binFileData"
			/>

		<!---
			Write that file data to our output stream. We are
			going to pretend that the binar data read we just
			did was actually a byte buffer.
		--->
		<cfset objOutputStream.Write(
			binFileData,
			JavaCast( "int", 0 ),
			JavaCast( "int", ArrayLen( binFileData ) )
			) />

	<cfelse>

		<!---
			We have finished reading in smaller chunk files.
			We can now close the new target file output stream
			which will finalize this process.
		--->
		<cfset objOutputStream.Close() />

		<!--- Break out of this loop. --->
		<cfbreak />

	</cfif>

</cfloop>

As you can see, for the demo, we are reading in the target file one megabyte at a time using a file input stream. That entire megabyte buffer is then written as a part file. Once this process is complete, we then read each part file using CFFile and write it to a file output stream. This way, no more than a megabyte is ever read into memory at any given time. So, while this might produce a lot of files (only 4 in my demo scenario), they are relatively small. Of course, you can increase the buffer size to reduce the number of files, but don't make it too large or you will eat up your memory.

Again, I realize that this doesn't exactly address your issue, but maybe this will help. Let me know if you want to see a demo of a buffered, large-chunk file solution. You can definitely use buffers to create large, 1-gig chunk files, but it is a bit more complicated. But let me know, and I can show you.

Want to use code from this post? Check out the license.

Short link: https://bennadel.com/1251

Reader Comments

Jen McVicker Aug 26, 2009 at 7:22 PM

2 Comments

Thank you for this; it's really helping a lot. I have a large text file that I need to split into two files of roughly the same size, but I need to make sure that the split happens immediately after a linefeed/carriage return. Is there a way to ensure that the split happens after a certain character in the content? I can't just split it by looping through the file as if it were a list of linefeed-delimited items; it's too large.

Ben Nadel Sep 2, 2009 at 10:03 AM

15,688 Comments

@Jen,

I guess you'd have to read in the file using some sort of buffered reader; then, when you are at around the right split size, start checking for the line break?

Jen McVicker Sep 2, 2009 at 11:31 AM

2 Comments

Thanks for replying, Ben. I figured it out - I split the file in half using your code, and then I read the second part, stripped off the text before the first linefeed, and appended that to the end of the first part. Worked like a charm!

Ben Nadel Sep 6, 2009 at 11:58 AM

15,688 Comments

@Jen,

Ok cool - glad you got it working.

Jason Minnick Sep 8, 2010 at 8:35 AM

2 Comments

Ben,

This was a good read and got me started down the right path. I had a real world application for this so I thought I'd post so others might be able to take advantange of streaming large binary data elements from a column.

<cfset variables.blockSize="64000" />

<cfquery name="getAssetMeta" datasource="[ds]">
select asset, asset_title, asset_size, asset_type, asset_content_type, asset_content_subtype
from denver_assets.[dbo].assets
where pk_asset_id = <cfqueryparam cfsqltype="cf_sql_integer" value="#arguments.doc_id#" />
</cfquery>

<cfset objByteBuffer = CreateObject("java","java.nio.ByteBuffer") />
<cfset objBuffer = objByteBuffer.Allocate(JavaCast( "int", getAssetMeta.asset_size )) />

<cfloop index="variables.offset" from="1" to="#getAssetMeta.asset_size#" step="#variables.blockSize#">

<cfquery name="getAsset" datasource="[ds]">
select
substring(asset,#VARIABLES.offset#,#VARIABLES.blockSize#) as [asset]
from assets
where pk_asset_id = <cfqueryparam cfsqltype="cf_sql_integer" value="#arguments.doc_id#" />
</cfquery>

<cfset byteObj = objBuffer.Put(getAsset.asset) />

</cfloop>

<cfset QuerySetCell(getAssetMeta, "asset", byteObj.Array()) />

Hope this helps someone else!
Jason

Ben Nadel Sep 10, 2010 at 6:36 PM

15,688 Comments

@Jason,

I'm having a little trouble following your queries. Why are you pulling out the assets a chunk at a time .... Ohhh, is this to get around the query MAX size? I know that a query (in ColdFusion) has limits imposed on it based on some CFAdmin settings; chunking the file would allow you to move more data from the Database into the ColdFusion query memory space without getting "fake" truncation issues.

Is that correct?

Jason Minnick Oct 28, 2010 at 12:07 PM

2 Comments

@Ben

Sorry for the delay.

Your assumption is correct. I store all of our documents in the database. Some of these documents are as large as 20mb or so. When you increase the query buffer size in cfadmin you start to take on tremendous performance hits.

I will say though, to each their own though but for me I like having my document meta data in the same place as the physical document itself. I hate storing a document in the file system and managing the meta data in the database. In the past it's always led to maintenace issues and desperate data.

Ben Nadel Nov 1, 2010 at 9:37 PM

15,688 Comments

@Jason,

No worries my man at all. I think what you're doing is pretty cool; once I figured out why you were splitting the file, I thought it was intriguing. I have been told that storing files in a database can lead to some performance problems, but it's not something I've tried. If it's working for you, it's a cool solution.

Anshuman Mar 25, 2013 at 5:58 AM

1 Comments

Hi Ben,

I have designed a web application to upload files from local machine to a server and download it anywhere in my network.

I have used Apache commons upload jar file for the same. My application is working fine when the file size is 1GB but when my file size exceeds it throws error.
"The field dataFile exceeds its maximum permitted size of 1073741824 bytes."

I think there is limitaion for uploading using the above jar.

But I want to upload large files to server, can I achieve the same by splitting the file and joining it.(I want to upload any kind of file e.g. .exe,.zip etc)

Is there any other way to achieve the same/

Regards,
Anshuman

Oh my chickens, this post is old!

Hit me up on Twitter if you want to discuss it further.