Parsing And Serializing Large Datasets Using Newline-Delimited JSON In Lucee 5.3.2.77

By Ben Nadel

Published 2019-07-27 in ColdFusion — Comments (5)

A couple of years ago, Adam DiCarlo taught me about the Newline-Delimited JSON format (NDJSON) for data storage. At the time, I took a look at using NDJSON in Node.js; and, since then, it's a format with which I've become quite enamored. Now that I'm starting to dig into Lucee CFML, I wanted to take a quick look at how easy it is to work with NDJSON data using Lucee's File and Compression functions. This is especially timely for me as I am considering ways to export large amounts of data from a Product in a way that can be easily consumed by end-users.

The Newline-Delimited JSON (NDJSON) format is exactly what it sounds like; it's a series of JSON (JavaScript Object Notation) payloads that are all stored in a single file, delimited by a newline character. The beauty of this format is that you can store a massive amount of data in a single file; and then, safely read it back, one line of a time, without running into memory or performance issues.

To see this in action, I'm going to loop over a collection of in-memory records (keeping it simple); and, for each record, I'm going to serialize the data as JSON and then append the JSON to a single export file. Each record will be appended as a new-line in the same file.

And, to make this demo a bit more exciting, I'm then going to use Lucee's compress() function to ZIP / archive the .ndjson file as .ndjson.zip:

  
          <cfscript>
        
          	// This is the data we are exporting in NDJSON format.
        
          	records = [
        
          		{ type: "contact", data: { id: 1, name: "Kim" } },
        
          		{ type: "phone", data: { id: 101, contactID: 1, value: "2125551111" } },
        
          		{ type: "phone", data: { id: 102, contactID: 1, value: "2125551118" } },
        
          		{ type: "contact", data: { id: 2, name: "Sarah" } },
        
          		{ type: "contact", data: { id: 3, name: "Arnold" } },
        
          		{ type: "phone", data: { id: 321, contactID: 3, value: "9175558811" } },
        
          		{ type: "phone", data: { id: 208, contactID: 2, value: "9175559928" } }
        
          	];
        
          	// ------------------------------------------------------------------------------- //
        
          	// ------------------------------------------------------------------------------- //
        
          	// The export data file is going to be a Newline-Delimited JSON file (NDJSON). This
        
          	// means that each record is going to be stored as its own JSON payload within the
        
          	// single file, where each payload is delimited by a newline character.
        
          	exportPath = "./data.ndjson";
        
          	exportFile = fileOpen( exportPath, "write" );
        
          	for ( record in records ) {
        
          		// Serialize each record as JSON and write it to the export file on its own line.
        
          		// --
        
          		// NOTE: Looking at the Lucee Java source code, we can see that this function
        
          		// appends "\n" when writing the data to the target file stream wrapper.
        
          		fileWriteLine( exportFile, serializeJson( record ) );
        
          	}
        
          	fileClose( exportFile );
        
          	// Since this is text-data, it should be highly compressible. Let's ZIP this baby up!
        
          	compress( "zip", exportPath, "#exportPath#.zip" );
        
          </cfscript>

view raw export.cfm hosted with ❤ by GitHub

As you can see, each record is serialized as JSON and then appended to the output file. And, when we run the above Lucee CFML code, we get the following file-output:

Newline-Delimited JSON data exported from Lucee CFML 5.3.2.77.

As you can see, each line of the export file contains an entire JSON payload.

Now, on the import side, instead of reading the entire file into memory and then parsing it - running the risk of performance and Out of Memory (OOM) issues - we can simply iterate over the file, reading it in one line at a time, parsing the smaller, isolated JSON payloads.

Lucee ColdFusion makes this particularly easy using the CFLoop tag. We can use CFLoop to iterate over a file, one line at a time, without having to worry about managing the mechanics of opening and closing the file reference:

  
          <cfscript>
        
          	// When unzipping / extract an archive using extract(), we can't give it a "target
        
          	// file" - we can only give it a "target directory". As such, I'm going to extract
        
          	// the zip file to a temp directory where we can extract its contents without
        
          	// colliding with the original input file (from the first part of this demo).
        
          	tempDirectory = "./extract-#createUniqueId()#";
        
          	directoryCreate( tempDirectory, true );
        
          	try {
        
          		extract( "zip", "./data.ndjson.zip", tempDirectory );
        
          		records = [];
        
          		// Each line of the extracted NDJSON file contains a separate JSON payload. As
        
          		// such, we can just loop over the contents of the file and deserialize each
        
          		// line in order to access the original data.
        
          		loop
        
          			item = "recordJson"
        
          			file = "#tempDirectory#/data.ndjson"
        
          			{
        
          			records.append( deserializeJson( recordJson ) );
        
          		}
        
          		dump( label = "Imported NDJSON", var = records );
        
          	// No matter what happens, let's delete the temp directory we created.
        
          	} finally {
        
          		directoryDelete( tempDirectory, true );
        
          	}
        
          </cfscript>

view raw import.cfm hosted with ❤ by GitHub

For the sake of this particular demo, I'm extracting the ZIP archive to a temp directory so that I don't overwrite the original file from the "export" portion of the demo. And, when we run the above code, we get the following browser output:

Newline-Delimited JSON data imported into Lucee CFML 5.3.2.77.

As you can see, using Lucee ColdFusion's file-loop construct, we were able to easily iterate over the export file, reading and processing one JSON payload at a time.

There's something so appealing to me about the NDJSON (Newline-Delimited JSON) file format. It holds the promise of storing a tremendous amount of data without the overhead of intensive CPU and Memory processing. It's basically the "divide and conquer" mindset applied to flat-file persistence. This, combined with Lucee's file and compression functions, is going to make it even easier to create export and import functionality for my ColdFusion applications.

Want to use code from this post? Check out the license.

Short link: https://bennadel.com/3668

Reader Comments

Charles Robertson Jul 27, 2019 at 10:30 AM

462 Comments

This is why I love reading your articles. I always learn something new. Especially, since you have embraced 'cfscript':

compress()
extract()
loop{}

Are all new to me. I guess I haven't tried to zip anything, using 'cfscript'!
And, I always use:

for(){}

When looping over stuff, so it's interesting to understand how to tackle lopping over a file's content in 'cfscript'.

In the past I have used:

cfloop(file=filePath, item="line"){ 
}

But your version looks much more concise.

As for the NDJSON format stuff, this is totally ingenious. It's such a simple idea, but it completely eliminates the 'out of memory' issue. I always squirm at the thought of copying large chunks of file data into memory, especially when it's not always apparent, how much data needs to be copied.

I am now itching to try this out, in one of my projects.

Ben Nadel Jul 28, 2019 at 9:44 AM

16,058 Comments

@Charles,

Ha ha, I'm glad that I can uncover little nooks and crannies of the language. Though, as far as using loop vs. cfloop, that's an interesting point. These days, all tags are supported in CFScript. But, in the older days, there was really only partial and duct-tape support for certain concepts.

For example, we had things savecontent, loop, and new Http(). But, now that we have solid tag-support in CFScript, I wonder if it would be better to have more consistency with the tags. Meaning, I think it might be better to use cfloop() instead of loop. That ways, all the script-based constructs are "Tags constructs."

Though, I'm just thinking out-load. I am not sure how strongly I feel. Remember, I am moving from ColdFusion 10 to Lucee 5, so I'm taking a massive jump in functionality -- I'm still trying to figure out how to do all the things.

Charles Robertson Jul 29, 2019 at 6:52 AM

462 Comments

Yes. You may have a point. After all, it is only a 2 letter prefix difference and I guess it enforces the CF brand...

Jeff Jones Jul 29, 2019 at 9:45 AM

1 Comments

I prefer not having the cf prefix on the commands as it is redundant.

I know it is cf code and anybody that is viewing it should know it is CF code.

I am waiting for an equivalent of executeQuery for stored procedures.

Ben Nadel Jul 29, 2019 at 11:05 AM

16,058 Comments

@Jeff,

I do like it without the cf prefix because the syntax is simpler. But, I don't feel very strongly. Also, having the parens makes it feel like a function call, which it is not.

Oh my chickens, this post is old!

Hit me up on Twitter if you want to discuss it further.

	<cfscript>

	// This is the data we are exporting in NDJSON format.
	records = [
	{ type: "contact", data: { id: 1, name: "Kim" } },
	{ type: "phone", data: { id: 101, contactID: 1, value: "2125551111" } },
	{ type: "phone", data: { id: 102, contactID: 1, value: "2125551118" } },
	{ type: "contact", data: { id: 2, name: "Sarah" } },
	{ type: "contact", data: { id: 3, name: "Arnold" } },
	{ type: "phone", data: { id: 321, contactID: 3, value: "9175558811" } },
	{ type: "phone", data: { id: 208, contactID: 2, value: "9175559928" } }
	];

	// ------------------------------------------------------------------------------- //
	// ------------------------------------------------------------------------------- //

	// The export data file is going to be a Newline-Delimited JSON file (NDJSON). This
	// means that each record is going to be stored as its own JSON payload within the
	// single file, where each payload is delimited by a newline character.
	exportPath = "./data.ndjson";
	exportFile = fileOpen( exportPath, "write" );

	for ( record in records ) {

	// Serialize each record as JSON and write it to the export file on its own line.
	// --
	// NOTE: Looking at the Lucee Java source code, we can see that this function
	// appends "\n" when writing the data to the target file stream wrapper.
	fileWriteLine( exportFile, serializeJson( record ) );

	}

	fileClose( exportFile );

	// Since this is text-data, it should be highly compressible. Let's ZIP this baby up!
	compress( "zip", exportPath, "#exportPath#.zip" );

	</cfscript>

	<cfscript>

	// When unzipping / extract an archive using extract(), we can't give it a "target
	// file" - we can only give it a "target directory". As such, I'm going to extract
	// the zip file to a temp directory where we can extract its contents without
	// colliding with the original input file (from the first part of this demo).
	tempDirectory = "./extract-#createUniqueId()#";
	directoryCreate( tempDirectory, true );

	try {

	extract( "zip", "./data.ndjson.zip", tempDirectory );

	records = [];
	// Each line of the extracted NDJSON file contains a separate JSON payload. As
	// such, we can just loop over the contents of the file and deserialize each
	// line in order to access the original data.
	loop
	item = "recordJson"
	file = "#tempDirectory#/data.ndjson"
	{

	records.append( deserializeJson( recordJson ) );

	}

	dump( label = "Imported NDJSON", var = records );

	// No matter what happens, let's delete the temp directory we created.
	} finally {

	directoryDelete( tempDirectory, true );

	}

	</cfscript>