Ask Ben: Splitting And Joining Large Binary Files Using Buffers In ColdFusion
I saw a tutorial on your site about splitting and joining files using CF [ColdFusion]. Is there any way to do with without reading in the entire contents of the file part(s) using ReadBinary? On large files this becomes very memory intensive obviously. If you are simply joining parts, why can't you join without having to read and do the checksum at the end to verify?
I know you are asking for a way to split and join files without using ReadBinary at all; and, while I think this is a good question, I wanted to try and address it in a simpler form first. As long as we are splitting files up into smaller parts, why don't we make those parts small enough to be managed by single binary reads. If we have a 2 gig file that we are trying to work with, don't split it up into two 1-gig files - a gig is still quite huge - split it up into many smaller parts. We don't lose anything from having more files. From personal experience, I see that most people split their large files using either 15 meg or 50 meg file sizes.
That being said, I am going to demo the splitting of a huge file using a small buffer such that the entire contents of original file only get read in a bit at a time and written to a smaller part file. Then, to rejoin these smaller files, we are going to assume they are each small enough to be read in via a CFFile [ action = readbinary ]:
<!--- Get the large target binary file that we want to split up in to several parts. Our demo file is only about 5 megabytes, but this should be sufficient to demo. ---> <cfset strTargetFile = ExpandPath( "crazy_insane.jpg" ) /> <!--- Set the name of the re-compiled target file. This will be the path to which we recombine all the individual data chunks. ---> <cfset strNewTargetFile = ExpandPath( "crazy_insane_new.jpg" ) /> <!--- Set the number of bytes that we want to use for our file chunking size. For our demo purposes (since we don't have a huge file), let's use about a megabyte. ---> <cfset intBufferSize = (1024 * 1024) /> <!--- Create a file input stream to read in the chunks of the binary file at a time so that we can split it up. ---> <cfset objInputStream = CreateObject( "java", "java.io.FileInputStream" ).Init( JavaCast( "string", strTargetFile ) ) /> <!--- Create a byte buffer into which we will read the file parts. This byte buffer will determine how large the file chunks are. We are going to use the underlying byte array of a string to create our byte array buffer. Let's make our byte buffer so that it is about a megabyte in size (1024 * 1024 bytes). ---> <cfset arrBuffer = RepeatString( " ", intBufferSize ).GetBytes() /> <!--- Now, we want to keep looping over the input stream and reading files until we no longer can read any more data. We are going to use an index loop with a huge max just to use the counter aspect of it. ---> <cfloop index="intFileIndex" from="1" to="99999" step="1"> <!--- Read from the input stream. ---> <cfset intBytesRead = objInputStream.Read( arrBuffer, JavaCast( "int", 0 ), JavaCast( "int", ArrayLen( arrBuffer ) ) ) /> <!--- Check to see if we read any bytes from the buffer. If so, then we want to write those to file. If not, then we are done reading data. ---> <cfif (intBytesRead GT 0)> <!--- Our buffer contains a certain amount of data. We cannont simply write this buffer to disk in whole because it might not be completely full. Therefore, we cannot use a plain CFFile. Let's create a file output stream so that we can leverage its buffer- using Write() method. When choosing the file name for this file chunk, use the index value of the current read iteration to creat a "part" file. ---> <cfset objOutputStream = CreateObject( "java", "java.io.FileOutputStream" ).Init( JavaCast( "string", "#strTargetFile#.part#intFileIndex#" ) ) /> <!--- Write our buffer to that file outpu stream. ---> <cfset objOutputStream.Write( arrBuffer, JavaCast( "int", 0 ), JavaCast( "int", intBytesRead ) ) /> <!--- Close the file output stream. ---> <cfset objOutputStream.Close() /> <cfelse> <!--- We are done reading data. Close the file input stream to free it up as a system resource. ---> <cfset objInputStream.Close() /> <!--- Break out of read loop. ---> <cfbreak /> </cfif> </cfloop> <!--- END: Split ----------------------------------- ---> <!--- We have now split our large binary file in to several smaller files. Let's see if we can put it back together again in a new binary file. ---> <!--- Again, we don't want to be reading the whole file into memory, so let's create a file output stream to which we can write out smaller file data. ---> <cfset objOutputStream = CreateObject( "java", "java.io.FileOutputStream" ).Init( JavaCast( "string", strNewTargetFile ) ) /> <!--- Now, we want to loop until we no longer can find any smaller chunk files. Sure we could just use the file index found above, but let's do this assuming we don't have any of the data from above. ---> <cfloop index="intFileIndex" from="1" to="99999" step="1"> <!--- Get the file name of the next chunk file. ---> <cfset strFileName = "#strTargetFile#.part#intFileIndex#" /> <!--- Check to see if which file part exists. ---> <cfif FileExists( strFileName )> <!--- Since we knows that these smaller files are not too big, we can simply do a binary read of the complete chunk files into memory. ---> <cffile action="readbinary" file="#strFileName#" variable="binFileData" /> <!--- Write that file data to our output stream. We are going to pretend that the binar data read we just did was actually a byte buffer. ---> <cfset objOutputStream.Write( binFileData, JavaCast( "int", 0 ), JavaCast( "int", ArrayLen( binFileData ) ) ) /> <cfelse> <!--- We have finished reading in smaller chunk files. We can now close the new target file output stream which will finalize this process. ---> <cfset objOutputStream.Close() /> <!--- Break out of this loop. ---> <cfbreak /> </cfif> </cfloop>
As you can see, for the demo, we are reading in the target file one megabyte at a time using a file input stream. That entire megabyte buffer is then written as a part file. Once this process is complete, we then read each part file using CFFile and write it to a file output stream. This way, no more than a megabyte is ever read into memory at any given time. So, while this might produce a lot of files (only 4 in my demo scenario), they are relatively small. Of course, you can increase the buffer size to reduce the number of files, but don't make it too large or you will eat up your memory.
Again, I realize that this doesn't exactly address your issue, but maybe this will help. Let me know if you want to see a demo of a buffered, large-chunk file solution. You can definitely use buffers to create large, 1-gig chunk files, but it is a bit more complicated. But let me know, and I can show you.
Want to use code from this post? Check out the license.
Thank you for this; it's really helping a lot. I have a large text file that I need to split into two files of roughly the same size, but I need to make sure that the split happens immediately after a linefeed/carriage return. Is there a way to ensure that the split happens after a certain character in the content? I can't just split it by looping through the file as if it were a list of linefeed-delimited items; it's too large.
I guess you'd have to read in the file using some sort of buffered reader; then, when you are at around the right split size, start checking for the line break?
Thanks for replying, Ben. I figured it out - I split the file in half using your code, and then I read the second part, stripped off the text before the first linefeed, and appended that to the end of the first part. Worked like a charm!
Ok cool - glad you got it working.
This was a good read and got me started down the right path. I had a real world application for this so I thought I'd post so others might be able to take advantange of streaming large binary data elements from a column.
<!--- block size increment --->
<cfset variables.blockSize="64000" />
<!--- get an initial read of the data elements needed --->
<cfquery name="getAssetMeta" datasource="[ds]">
select asset, asset_title, asset_size, asset_type, asset_content_type, asset_content_subtype
where pk_asset_id = <cfqueryparam cfsqltype="cf_sql_integer" value="#arguments.doc_id#" />
<!--- stream the binary from the table and append to the buffer obj --->
<cfset objByteBuffer = CreateObject("java","java.nio.ByteBuffer") />
<cfset objBuffer = objByteBuffer.Allocate(JavaCast( "int", getAssetMeta.asset_size )) />
<cfloop index="variables.offset" from="1" to="#getAssetMeta.asset_size#" step="#variables.blockSize#">
<cfquery name="getAsset" datasource="[ds]">
substring(asset,#VARIABLES.offset#,#VARIABLES.blockSize#) as [asset]
where pk_asset_id = <cfqueryparam cfsqltype="cf_sql_integer" value="#arguments.doc_id#" />
<cfset byteObj = objBuffer.Put(getAsset.asset) />
<!--- reset the asset column with the value of the new asset to pass back to calling object --->
<cfset QuerySetCell(getAssetMeta, "asset", byteObj.Array()) />
Hope this helps someone else!
I'm having a little trouble following your queries. Why are you pulling out the assets a chunk at a time .... Ohhh, is this to get around the query MAX size? I know that a query (in ColdFusion) has limits imposed on it based on some CFAdmin settings; chunking the file would allow you to move more data from the Database into the ColdFusion query memory space without getting "fake" truncation issues.
Is that correct?
Sorry for the delay.
Your assumption is correct. I store all of our documents in the database. Some of these documents are as large as 20mb or so. When you increase the query buffer size in cfadmin you start to take on tremendous performance hits.
I will say though, to each their own though but for me I like having my document meta data in the same place as the physical document itself. I hate storing a document in the file system and managing the meta data in the database. In the past it's always led to maintenace issues and desperate data.
No worries my man at all. I think what you're doing is pretty cool; once I figured out why you were splitting the file, I thought it was intriguing. I have been told that storing files in a database can lead to some performance problems, but it's not something I've tried. If it's working for you, it's a cool solution.
I have designed a web application to upload files from local machine to a server and download it anywhere in my network.
I have used Apache commons upload jar file for the same. My application is working fine when the file size is 1GB but when my file size exceeds it throws error.
"The field dataFile exceeds its maximum permitted size of 1073741824 bytes."
I think there is limitaion for uploading using the above jar.
But I want to upload large files to server, can I achieve the same by splitting the file and joining it.(I want to upload any kind of file e.g. .exe,.zip etc)
Is there any other way to achieve the same/