Deleting Temporary Upload Files In Our K8 Operational Readiness Probe In Lucee CFML 5.3.6.61

By Ben Nadel

Published 2020-09-10 in ColdFusion

Over the weekend, I talked about the fact that some of our Lucee CFML containers at InVision were running out of disk space. It seems that the current release of Lucee will generate a duplicate temporary .upload file per CFThread; and, that it won't delete these files automatically after the parent request has completed. As such, I had to step in and try to resolve the matter programmatically. What I ended up doing was augmenting our Kubernetes (K8) operational readiness probe to monitor the getTempDirectory() folder and delete .upload files that were older than 15-minutes (which was just an arbitrary amount of time). So far, it seems to be working, so I wanted to share my approach in Lucee CFML 5.3.6.61.

Trying To Understand Lucee CFML Temporary File Management

As I've been digging into this stuff, I've learned some things from people like Zac Spitzer and from looking at the Lucee CFML source code. It turns out that the Lucee server actually has a background process that monitors the getTempDirectory() contents and deletes temporary files until one of two conditions is true:

There are less than 100,000 files.
There are less than 1Gb worth of files in aggregate.

The evidence of this background process shows up in log entries that look like this (breaking it onto multiple lines for readability):

"INFO",
"Thread-12351",
"09/10/2020",
"05:20:24",
"lucee.runtime.engine.Controler",
"check size of directory [/opt/lucee/web/temp];
current size [5528820];
max size [1073741824]"

And, if I look at my production log aggregation, I am seeing these entries about once per hour per Lucee server. But, I am also seeing some evidence that this background task is failing. Those entries look like this (breaking it onto multiple lines for readability):

ERROR",
"Thread-103972",
"09/09/2020",
"03:56:08",
"lucee.runtime.engine.Controler",
"cannot remove resource /opt/lucee/web/temp/temp-signature-21.tmp

Now, the part that confuses me is that I am not seeing these two different types of log entry showing up in lock-step. If the process were continually failing, I would expect to see a cannot remove entry following every check size entry. However, this is not the case. When I drill down to one particular pod-ID, I'll see 24 entries for check size and maybe only 2 or 3 entries for cannot remove. As such, it's not clear why the overall process is failing to thrive.

It could be that some logs are getting dropped? I've had issues with missing log entries in the past.

But, regardless of why I'm not seeing the expected log entries, I do have a theory as to why the background task isn't working:

The background task only fires once-an-hour, which is comparatively infrequent with regard to how many .upload files are being generated on some Lucee servers.
The background task bails-out of its process the moment it hits a file-IO error.
It is not using any "intelligence" when deleting old files. Meaning, it just loops over the files in the file-list and deletes them in turn, regardless of age.
It attempts to delete new .upload files that are currently in use by another Lucee thread (because it doesn't sort the files in any way before attempting the delete).

With all of those points together, I do believe Lucee's background task is running and is deleting some files; I just don't believe it is running effectively enough in this case to keep up with the large volume and size of temporary files that some of our containers are creating.

Explicitly Deleting Our Temporary Upload Files In The Background

With all that said, our Platform team was continuing to get paged with Disk Consumption warnings; so, I had to step in and start deleting files programmatically. Here's what I did.

InVision runs its production environments on Kubernetes (K8). And, part of the way K8 works is that each container gets a liveness check, to see if the server is responsive, and a readiness check, to see if the server should receive traffic. The readiness check for my team's ColdFusion services is called every 10-seconds per container. This gives me a hook into executing code on regular basis. This is where I ended up sticking my file-deletion logic.

First, I created helper ColdFusion component to create an abstraction about the getTempDirectory() folder, TempDirectoryHelper.cfc. It provides two methods for gather file statistics and deleting old files:

component
	output = false
	hint = "I provide helper methods for Lucee's temp-directory."
	{

	/**
	* I initialize the temp-directory helper using the given directory as the target
	* directory.
	* 
	* @tempDirectoryPath I am the temp-directory for the server.
	*/
	public void function init( string tempDirectoryPath = getTempDirectory() ) {

		variables.tempDirectoryPath = arguments.tempDirectoryPath;

		// Ensure that the temp-directory ends with a slash - we'll need this when we
		// programmatically generate file-paths later on.
		if ( ! variables.tempDirectoryPath.reFind( "[\\/]$" ) ) {

			variables.tempDirectoryPath &= "/";

		}

	}

	// ---
	// PUBLIC METHODS.
	// ---

	/**
	* I delete files from the temp-directory that match the given filter and are older
	* than the given age in minutes. I return the number of files deleted.
	* 
	* CAUTION: If a File IO error is thrown, it halts the current process and the error
	* bubbles up to the calling context.
	* 
	* @filter I am the filter to pass to directoryList().
	* @ageInMinutes I am the positive age in minutes after which a file will be deleted.
	*/
	public numeric function deleteOldTempFiles(
		required string filter,
		required numeric ageInMinutes
		) {

		var filesQuery = directoryList(
			path = tempDirectoryPath,
			listInfo = "query",
			filter = filter,
			sort = "dateLastModified ASC",
			type = "file"
		);
		var cutoffAt = now().add( "n", -ageInMinutes );
		var deletedCount = 0;

		loop query = filesQuery {

			if ( filesQuery.dateLastModified < cutoffAt ) {

				fileDelete( tempDirectoryPath & filesQuery.name );
				deletedCount++;

			} else {

				// Since we sorted the file-query by date-descending, we know that once
				// we hit a file that is relatively "new", we've exhausted all of the old
				// files. Everything after this is going to be "new" and can be ignored.
				break;

			}

		}

		return( deletedCount );

	}


	/**
	* I return the size and count stats for the temp-directory.
	* 
	* CAUTION: Any sub-directories that are part of the temp-directory will be counted
	* as a single item of size zero bytes. Right now, I think our problem is related to
	* temporary files, but we can re-evaluate later.
	*/
	public struct function getStats() {

		var totalCount = 0;
		var totalSize = 0;

		if ( directoryExists( tempDirectoryPath ) ) {

			var filesQuery = directoryList(
				path = tempDirectoryPath,
				listInfo = "query"
			);

			totalCount = filesQuery.recordCount;

			loop query = filesQuery {

				totalSize += size;

			}

		}

		return({
			count: totalCount,
			size: totalSize
		});

	}

}

My .deleteOldTempFiles() method is an attempt to do what Lucee's background task is already doing; but, with some added intelligence around the sorting of files by date that should help stave off file-IO conflicts. The thing to remember is that the getTempDirectory() folder is constantly in use, especially on our servers which deal with image processing. As such, we have to take special care when trying to delete files by prioritizing the files are likely to be no longer in play.

With this TempDirectoryHelper.cfc component read to go, I then created two operations feature flags in LaunchDarkly:

OPERATIONS--cfprojectsapi--measure-temp-directory-in-health-check
OPERATIONS--cfprojectsapi--delete-old-temp-files

The former will determine whether or not the K8 readiness probe will gather (and graph) stats about the size of the getTempDirectory() folder; and, the latter will determine whether or not the K8 readiness probe will attempt to delete old files in the getTempDirectory() folder. I wanted to put them under two different feature flags because they can operate independently; and, I wasn't sure what kind of load these two different algorithms would put on the server's CPU. As with all things feature flag related, better safe than sorry.

I then took these feature flags and integrated them into my server's K8 readiness probe. Here is a truncated version of this ColdFusion component showing the parts that are relevant to this blog-post. Essentially, the readiness probe has one method - logProbeRequest() - that spawns a number of background processes, two of which will be our temp-directory code:

component
	output = false
	hint = "I help log JVM system data for the health-probe."
	{

	// ... truncated code ...

	// ---
	// PUBLIC METHODS.
	// ---

	/**
	* I log the operations probe / health check for the server. The return value
	* indicates whether or not the probe should be kept online (true = online); or, if it
	* should be failed out of rotation.
	*/
	public boolean function logProbeRequest() {

		// NOTE: All of the runSafelyAsync() calls are spawned in a CFThread and
		// synchronized internally such that overlapping health-probe calls will not
		// trigger competing system measurements.

		// ... truncated code ...

		if ( shouldMeasureTempDirectory() ) {

			runSafelyAsync( "measureTempDirectory" );

		}

		// CAUTION: This is a "hack" that we're putting in place because Lucee's native
		// background-task seems to be failing and we don't know why. As such, we're
		// going to see if we can fill in the gap while we continue to debug the issue.
		if ( shouldDeleteOldTempFiles() ) {

			runSafelyAsync( "deleteOldTempFiles" );

		}

		// ... truncated code ...

	}

	// ---
	// PRIVATE METHODS.
	// ---

	/**
	* I delete temporary upload files from the server's temp directory.
	*/
	private void function deleteOldTempFiles() {

		var fileCount = tempDirectoryHelper.deleteOldTempFiles(
			filter = "tmp-*.upload",
			ageInMinutes = 15
		);

		datadog.increment(
			key = getMetricName( "temp-directory.files-deleted" ),
			magnitude = fileCount,
			tags = tags
		);

	}

	// ... truncated code ...

	/**
	* I record the current Lucee CFML temp-directory usage for this pod.
	*/
	private void function measureTempDirectory() {

		var directoryStats = tempDirectoryHelper.getStats();

		datadog.gauge(
			key = getMetricName( "temp-directory.total-size" ),
			value = directoryStats.size,
			tags = tags
		);
		datadog.gauge(
			key = getMetricName( "temp-directory.total-count" ),
			value = directoryStats.count,
			tags = tags
		);

	}

	// ... truncated code ...

	/**
	* I execute and synchronize access to the given measurement method. This ensures
	* that the act of running the given method doesn't, in and of itself, cause the
	* health probe to fail.
	*/
	private void function runSafelyAsync( required string methodName ) {

		try {

			thread
				name = "probe.#methodName#-thread"
				action = "run"
				methodName = methodName
				{

				try {

					// Now that this method is being invoked asynchronously to the probe
					// request, there's a chance that this measurement may overlap with
					// a subsequent measurement. As such, let's synchronize the
					// measurement operation(s) so that only one-of-a-type can be
					// executing at a time.
					lock
						name = "probe.#methodName#-lock"
						type = "exclusive"
						timeout = 1
						throwOnTimeout = false
						{

						invoke( variables, methodName );

					} // END: Lock.

				} catch ( any error ) {

					debugLogger.error( error, "Error calling [#methodName#()] in the operations probe." );

				}

			} // END: Thread.

		} catch ( any error ) {

			debugLogger.error( error, "Error spawning async thread in the operations probe." );

		}

	}

	// ... truncated code ...

	/**
	* I check to see if the current machine should attempt to delete old files in the
	* server's temp-directory.
	*/
	private boolean function shouldDeleteOldTempFiles() {

		return( featureFlagService.getFeatureByUserID( hostName, "OPERATIONS--cfprojectsapi--delete-old-temp-files" ) );

	}

	// ... truncated code ...
	
	/**
	* I check to see if the current machine should be measuring temp-directory stats as
	* part of the metrics that is emits on each health-check.
	*/
	private boolean function shouldMeasureTempDirectory() {

		return( featureFlagService.getFeatureByUserID( hostName, "OPERATIONS--cfprojectsapi--measure-temp-directory-in-health-check" ) );

	}

	// ... truncated code ...

}

The .logProbeRequest() conditionally invokes two methods in individual background threads:

measureTempDirectory()
deleteOldTempFiles()

These record StatsD / Datadog metrics around files size and delete old files, respectively, using the TempDirectoryHelper.cfc. And, as you can see from the .deleteOldTempFiles() invocation, I'm attempting to delete files that match the given criteria:

Match the file pattern, tmp-*.upload.
Are older than 15-minutes in age.

It's the "older than" part of this that I believe will counteract the failures of Lucee's native background file management.

Once I deployed this and turned it on (gradually and safely using a percentage-based roll-out as I watched the JVM performance), I got to see it in action in DataDog. The following three graphs in order are:

The total size in MB of getTempDirectory(), not accounting for nested folders.
The total number of files in getTempDirectory(), not account for nested folders.
The total number of files that were deleted by my K8 readiness probe background task.

Each of these metrics is being graphed by container, which is why there are so many lines - we have a lot of Lucee CFML servers (for the win)!

StatsD metrics around getTempDirectory() file management in Lucee CFML.

Click here for a slightly larger image

These graphs show a 15-minute time-span in which each point on that graph represents a 10-second aggregation of metrics. This means that every 10-seconds, the background task is deleting as much as 135-files in the getTempDirectory(). Considering that the native Lucee CFML file-cleanup runs only once per hour (and will short-circuit on file IO errors), it's no wonder that it's not keeping up with the demands of the server.

With this K8 readiness probe integration in place on my Lucee CFML 5.3.6.61 servers, we seem to be keeping the problem at bay. The getTempDirectory() folders hover around 4K files taking up less than 1Gb in space. I am sure this is adding additional load to the CPU; but, it's also preventing my Platform engineers from getting paged in the middle of the night. All things are a trade-off.

A Future Without This Hack

I view this K8 readiness probe integration as a "temporary hack" that I will one-day be able to remove when / if Lucee CFML makes the following changes:

They stop duplicating .upload files for each CFThread being spawned. A big part of the problem here is that the volume of temporary files is out-pacing the native clean-up task. There are some bug-tickets around this:
- Parallel processing array.each() duplicates tmp-xxx.upload files on every loop, crashing the server - was already filed.
- Temporary .upload files persisted indefinitely when using CFThread - is the one I filed over the weekend.
It sounds like the "duplication" of temporary files has already been solved in the latest Lucee CFML snapshot; so, I look forward to upgrading my servers in the near(ish) future.
They update Lucee's background file-cleanup to sort files by age in order to avoid file-IO conflicts.

If those happen, I'll be happy to remove my code.

Want to use code from this post? Check out the license.

Short link: https://bennadel.com/3891

Reader Comments

Oh my chickens, this post is old!

Hit me up on LinkedIn if you want to discuss it further.