Using Amazon S3 As Temporary Storage In Lucee CFML 5.3.6.61

By Ben Nadel

Published 2020-09-29 in ColdFusion

At InVision, we have several workflows that create "temporary files" that we have to make available to users for a brief period of time. Right now, we do that with a few fixed locations which we have to keep track of and then subsequently delete for both security and GDPR (General Data Protection Regulation) compliance reasons. This is a pain. As such, I wanted to noodle on ways in which I could use Amazon AWS S3 (Simple Storage Service) to store temporary files that remained secure but require less coordination and overhead in Lucee CFML 5.3.6.61.

The Best Way Would Be To Use S3 Object Expiration Rules

To be clear, Amazon S3 already provides a means to do this automatically using "Object Expiration" rules. Right now, these rules can be applied:

At the bucket level.
At the object-prefix level.

This means that any object stored in a given bucket or path-prefix would be automatically deleted after a given period of time without our application having to do any additional work.

In an ideal world, I would just use this approach. However, this would have to be configured across hundreds of different environments; and, while our Platform Team provides some outstanding automation, I'm told that performing this level of configuration would be a bit tricky - and would not be a high priority for a team whose road-map is already chock-full of feature requests.

As such, while the S3 object expiration rules are the obvious right choice, it is not an approach I can leverage at this time.

The Brute-Force Way Given My Constraints

Given the fact that I can't use the "correct approach" of Object Expiration rules, I have to come up something that requires more explicit logic. And, the approach that I've been noodling-on is to use an Amazon AWS S3 path-prefix kind of like a queue of objects. I got this idea when I was looking through the AWS Java SDK documentation when I saw that listing objects uses alphabetical ordering:

public ObjectListing listObjects() - Returns a list of summary information about the objects in the specified buckets. List results are always returned in lexicographic (alphabetical) order.

Given this behavior, if I can store objects using a resource path that starts with a fixed-width Date/Time stamp, then it means that subsequent list operations for said objects will always return the oldest objects first.

For the sake of this exploration, I'm going to use the path prefix ttl/ (TTL stands for "Time To Live"). Now imagine that I start storing objects using resource paths that look like this:

ttl / { yyyy-mm-dd } / { HH-mm-ss }-{ some unique token }

To make this more concrete, here are a few made-up examples of this path in which we interpolate values into the above placeholders:

ttl/2020-09-28/06-52-13-11111.txt
ttl/2020-09-28/15-03-48-22222.txt
ttl/2020-09-29/11-48-27-33333.txt
ttl/2020-09-29/23-34-38-44444.txt

As you can start to see, the fixed-width date/time tokens cause the objects to be stored under an ascending date-based order. Which, again, means that when we go to list objects under the ttl/ prefix, the objects are going to be returned in an ascending date-based order. In other words, with the oldest objects first.

Using this path scheme, I can then start storing objects as needed as long as I have a process in the background which periodically inspects the ttl/ prefix and deletes objects over a certain age. And, since objects are listed lexicographically, it means that the oldest objects - and the objects most likely in need of deletion - are going to be the first ones returned in the listing.

To see this exploration in action, I've created two ColdFusion pages: one that creates a bunch of files using a fixed-width date/time prefix; and, one that then starts deleting files under the ttl/ prefix that are older than a calculated cut-off period.

I've also created a simple S3-Client wrapper in Lucee CFML; but, we'll look at that later since it's more an implementation detail than a pertinent part of this exploration. First, let's look at the ColdFusion page that stores the objects under the ttl/ prefix:

<cfscript>

	jarPaths = directoryList(
		path = expandPath( "../aws-s3-sdk/" ),
		recurse = true,
		listInfo = "path",
		type = "file",
		filter = "*.jar"
	);

	s3Client = new S3Client(
		awsAccessID = server.aws.accessID,
		awsSecretKey = server.aws.secretKey,
		awsRegion = server.aws.region,
		awsBucket = server.aws.bucket,
		jarPaths = jarPaths
	);

	// ------------------------------------------------------------------------------- //
	// ------------------------------------------------------------------------------- //

	// Let's create a bunch of test files that we can subsequently delete, yay!
	for ( i = 1 ; i <= 25 ; i++ ) {

		moment = dateConvert( "local2utc", now() );
		// When we later "list" the objects in the "/ttl/" prefix, AWS will always return
		// the objects in lexicographic (alphabetical) order. As such, we want to store
		// the objects using a fixed-width date/time prefix. This way, the "oldest"
		// objects will always be the first ones returned during list-pagination.
		dateToken = moment.dateFormat( "yyyy-mm-dd" );
		timeToken = moment.timeFormat( "HH-mm-ss" );
		uuidToken = createUUID().lcase();

		// NOTE: We're adding a path-delimiter between the Date and Time just to make the
		// AWS console easier to navigate (since it will group the common date-prefixes
		// into a "folder like" structures in which each date is its own folder).
		resourcePath = "/ttl/#dateToken#/#timeToken#-uuid-#uuidToken#.txt";

		s3Client.putObject(
			resourcePath,
			charsetDecode( "Hello world, testing with #resourcePath#.", "utf-8" ),
			"text/plain"
		);

	}

</cfscript>

As you can see, there's very little logic in this page - I'm just using two date/time masks - yyyy-mm-dd and HH-mm-ss - to create a fixed-width, lexicographic prefix for each object. And then, of course, I am including a UUID (Universally Unique ID) to make sure that two objects stored in the same second don't collide with each other.

Once these objects start collecting on Amazon AWS S3, I then need a secondary process that scans the ttl/ prefix looking for objects of a certain age. This process wouldn't have to run very often - even running this weekly or monthly would probably be sufficient:

<cfscript>

	jarPaths = directoryList(
		path = expandPath( "../aws-s3-sdk/" ),
		recurse = true,
		listInfo = "path",
		type = "file",
		filter = "*.jar"
	);

	s3Client = new S3Client(
		awsAccessID = server.aws.accessID,
		awsSecretKey = server.aws.secretKey,
		awsRegion = server.aws.region,
		awsBucket = server.aws.bucket,
		jarPaths = jarPaths
	);

	// ------------------------------------------------------------------------------- //
	// ------------------------------------------------------------------------------- //

	// The objects located under the "TTL" path-prefix are stored with a fixed-width
	// date/time prefix. As such, the lexicographic (alphabetical) order of the Listing
	// will ensure that the oldest files are returned first. Therefore, we are going to
	// keep getting (and deleting) the first page of objects until we hit one that is too
	// young to delete.
	do {

		// NOTE: For this exploration, I'm using low numbers (MaxKeys and Cutoffs). But,
		// in a production environment, I'd use larger values.

		// NOTE: When listing the objects, we don't need to use a "continuation key"
		// because we are DELETING the oldest objects within each do-while iteration. As
		// such, every "next" iteration of this loop just has to get the oldest files
		// that STILL EXIST (which, lexicographically, are going to be the first ones
		// returned in the list operation).
		results = s3Client.listObjects(
			resourcePathPrefix = "/ttl/",
			maxKeys = 5
		);

		// The AWS API returns the "lastModified" date in the UTC timezone. As such, we
		// have to convert the local date to UTC when calculating the cutoff.
		// --
		// NOTE: In a production environment, the server would likely running in UTC as
		// well and this step would be superfluous. However, I like to leave my local
		// development environment running in EST so that I can more easily spot date-
		// handling errors in my logic. Meaning, I don't want my code to ASSUME that it
		// is running in UTC - I want it to always convert to UTC explicitly.
		cutoffAt = now().add( "n", -3 );
		cutoffAtUTC = dateConvert( "local2utc", cutoffAt );

		pathsToDelete = results.objects
			// Limit to objects older than the cut-off.
			.filter(
				( summary ) => {

					return( summary.lastModified < cutoffAtUTC );

				}
			)
			// Map the object summaries to raw S3 paths (which we will delete).
			.map(
				( summary ) => {

					return( summary.key );

				}
			)
		;

		// If this batch of object listings resulted in any S3 objects to delete, we can
		// perform a multi-object delete of up to 1,000 keys (the max number of keys
		// allowed by AWS at the time of this writing).
		if ( pathsToDelete.len() ) {

			echo( "PATHS: <br />" );
			echo( pathsToDelete.toList( "<br />" ) );
			echo( "<br /><br />" );

			s3Client.deleteObjects( pathsToDelete );

		}

		// Since we are going to continue looping until we hit an object that is too
		// young to delete, we know that we can short-circuit the process if the filtered
		// paths don't match the raw results. The moment we have more results than we
		// have paths to delete, we know that the current iteration came across an object
		// that crossed-over the cutoff-line (ie, was too young to delete).
		doContinueDeleting = (
			results.isTruncated &&
			( pathsToDelete.len() == results.objects.len() )
		);

	} while ( doContinueDeleting );
	
	echo( "Done deleting." );

</cfscript>

As you can see, this background process is using the ttl/ prefix like a queue in which the oldest objects are at the head of the queue. Then, our do-while loop is essentially "popping objects" off of that queue and deleting them. And, it keeps popping, deleting, and looping until it hits an object that is "too young" to delete. We can use the ttl/ prefix like a queue because the fixed-width date/time prefix forces the lexicographic behavior of object-listing to return the oldest objects first.

If I now run this ColdFusion deletion code in the browser, I get the following output:

PATHS:
ttl/2020-09-29/10-27-23-uuid-367d4ae6-fae3-42e6-92b153f2354dacbb.txt
ttl/2020-09-29/10-27-23-uuid-c2b8635a-9f52-4a31-b7fa2f4d53de69ba.txt
ttl/2020-09-29/10-27-23-uuid-f1ccc14a-03c4-4a46-95ff16575559b9a0.txt
ttl/2020-09-29/10-27-24-uuid-01fd5091-2468-4c26-a30bcb8d92a96ce9.txt
ttl/2020-09-29/10-27-24-uuid-30d9ab0b-9626-4165-a4829bc2d0d637ef.txt

PATHS:
ttl/2020-09-29/10-27-24-uuid-31a6a97c-5dd5-4a0f-92dd2b5e7295eee6.txt
ttl/2020-09-29/10-27-24-uuid-4172872d-e7d9-4886-b608ed21e265c121.txt
ttl/2020-09-29/10-27-24-uuid-429430c9-6734-4303-88dfab3fcb276877.txt
ttl/2020-09-29/10-27-24-uuid-4c032e8f-f43d-4d62-8253900740c3aa36.txt
ttl/2020-09-29/10-27-24-uuid-541a2ffa-3322-4538-89c34583dfa7a91b.txt

PATHS:
ttl/2020-09-29/10-27-24-uuid-565a34b0-fb6d-4fcf-b77600e26712abbf.txt
ttl/2020-09-29/10-27-24-uuid-899eff4a-dc22-489d-a96ed7c91258ad8a.txt
ttl/2020-09-29/10-27-24-uuid-dc2d8ce5-bd6a-42ab-861b03e1b2143aa9.txt
ttl/2020-09-29/10-27-24-uuid-efac7411-6b41-4c51-be39bc79710acc80.txt
ttl/2020-09-29/10-27-25-uuid-19643375-fa32-4b3a-a5e462f31a4d6771.txt

PATHS:
ttl/2020-09-29/10-27-25-uuid-48a9b0ac-5abd-4734-bd6639d4b37c0a8d.txt
ttl/2020-09-29/10-27-25-uuid-51a510f5-143b-49f6-bfbb038ef36b2a50.txt
ttl/2020-09-29/10-27-25-uuid-757ccd26-0c63-44e5-97dc8c8978cd4101.txt
ttl/2020-09-29/10-27-25-uuid-8bc2f069-e742-4d88-bdfdf80e528ccede.txt
ttl/2020-09-29/10-27-25-uuid-8f3647a7-93c8-43a9-9b99a7081a14be7b.txt

PATHS:
ttl/2020-09-29/10-27-25-uuid-b051903c-d253-4f8a-a44b8017ac3b0e75.txt
ttl/2020-09-29/10-27-25-uuid-b59b1b51-4697-49f5-9a1464c1287acd71.txt
ttl/2020-09-29/10-27-25-uuid-d550cc96-de96-41dd-8b1cb782deb068c8.txt
ttl/2020-09-29/10-27-25-uuid-f385159d-0b82-4558-aa6aba6e6307f7c0.txt
ttl/2020-09-29/10-27-26-uuid-c39834f9-5dc7-4a18-9babd84bdbca1361.txt

Done deleting.

All of my test S3 objects were created "today", so the date-portion is all the same. But, as you can see from the HH-mm-ss timestamp portion, the oldest objects were returned first, again, treating the ttl/ prefix like a FIFO (First In, First Out) queue.

To be clear, if I could use the "Object Expiration" rules in Amazon AWS S3, this would be a no-brainer - I wouldn't have to do any work within my Lucee CFML application code. However, given some of our platform constraints, that's not an option. That said, I feel like "prefix as queue" approach could really work. And, one thing our platform does have is the ability to run background tasks on a regular basis.

As part of this exploration, I had to create a small Amazon AWS S3 client-wrapper. And, to do that, I used Lucee CFML's ability to load Java classes from a set of JAR files; which is one of the best things since sliced bread! I still don't have a good way to download files from Maven; so, I manually went and grabbed the following files:

avalon-framework-4.1.5.jar
aws-java-sdk-core-1.11.870.jar
aws-java-sdk-kms-1.11.870.jar
aws-java-sdk-s3-1.11.870.jar
commons-codec-1.11.jar
commons-logging-1.1.3.jar
commons-logging-1.2.jar
httpclient-4.5.9.jar
httpcore-4.4.11.jar
ion-java-1.0.2.jar
jackson-annotations-2.6.0.jar
jackson-core-2.6.7.jar
jackson-databind-2.6.7.3.jar
jackson-dataformat-cbor-2.6.7.jar
jmespath-java-1.11.870.jar
joda-time-2.8.1.jar

These are the JAR files that I am passing to my S3Client.cfc ColdFusion component:

component
	output = false
	hint = "I provide a simple wrapper for the S3 Client JDK classes."
	{

	/**
	* I initialize the S3 Client wrapper with the given configuration.
	*/
	public void function init(
		required string awsAccessID,
		required string awsSecretKey,
		required string awsRegion,
		required string awsBucket,
		required array jarPaths
		) {

		variables.awsAccessID = arguments.awsAccessID;
		variables.awsSecretKey = arguments.awsSecretKey;
		variables.awsRegion = arguments.awsRegion;
		variables.awsBucket = arguments.awsBucket;
		variables.jarPaths = arguments.jarPaths;

		variables.s3Client = createS3Client();

	}

	// ---
	// PUBLIC METHODS.
	// ---

	/**
	* I delete the objects located at the given paths.
	* 
	* CAUTION: At the time of this writing, a maximum of 1,000 paths can be deleted at
	* one time.
	* 
	* @resourcePaths I am the collection of keys to delete.
	*/
	public struct function deleteObjects( required array resourcePaths ) {

		var normalizedPaths = resourcePaths.map(
			( resourcePath ) => {

				return( normalizeResourcePath( resourcePath ) );

			}
		);

		var deleteConfig = loadClass( "com.amazonaws.services.s3.model.DeleteObjectsRequest" )
			.init( awsBucket )
			.withQuiet( true )
			.withKeys( javaCast( "string[]", normalizedPaths ) )
		;

		var awsResponse = s3Client.deleteObjects( deleteConfig );

		return({
			awsResponse: awsResponse
		});

	}

	/**
	* I return a paginated list of objects located under the given prefix.
	* 
	* @resourcePathPrefix I am the key-prefix as which to search.
	* @continuationToken I am the continuation token for an existing pagination.
	* @maxKeys I am the maximum number of objects to return in a given page.
	*/
	public struct function listObjects(
		required string resourcePathPrefix,
		string continuationToken = "",
		numeric maxKeys = 1000
		) {

		var listConfig = loadClass( "com.amazonaws.services.s3.model.ListObjectsV2Request" )
			.init()
			.withBucketName( awsBucket )
			.withPrefix( normalizeResourcePath( resourcePathPrefix ) )
			.withMaxKeys( maxKeys )
		;

		// If a list of results has been truncated (ie, there are more objects at the
		// given prefix than there are maxKeys), then we can provide a "cursor" at which
		// to get the next set of objects.
		if ( continuationToken.len() ) {

			listConfig.withContinuationToken( continuationToken );

		}

		var awsResponse = s3Client.listObjectsV2( listConfig );

		var objects = awsResponse.getObjectSummaries().map(
			( summary ) => {

				return({
					bucket: summary.getBucketName(),
					key: summary.getKey(),
					size: summary.getSize(),
					// NOTE: The AWS API returns the dates in UTC/GMT timezone. However,
					// the AWS SDK appears to parse them into the server's local
					// timezone. As such, let's convert them back to UTC for the sake of
					// consistency with our date-handling.
					lastModified: dateConvert( "local2utc", summary.getLastModified() )
				});

			}
		);

		return({
			objects: objects,
			maxKeys: awsResponse.getMaxKeys(),
			isTruncated: awsResponse.isTruncated(),
			continuationToken: awsResponse.getNextContinuationToken(),
			awsResponse: awsResponse
		});

	}


	/**
	* I store the given object / binary payload at the given path.
	* 
	* @resourcePath I am the key at which to store the object.
	* @resourceContent I am the object data being stored.
	* @resouceContentType I am the content/mime-type associated with the object.
	*/
	public struct function putObject(
		required string resourcePath,
		required binary resourceContent,
		required string resouceContentType
		) {

		var objectMetadata = loadClass( "com.amazonaws.services.s3.model.ObjectMetadata" )
			.init()
		;
		objectMetadata.setSSEAlgorithm( objectMetadata.AES_256_SERVER_SIDE_ENCRYPTION );
		objectMetadata.setContentLength( arrayLen( resourceContent ) );
		objectMetadata.setContentType( resouceContentType );

		var normalizedPath = normalizeResourcePath( resourcePath );
		var resourceStream = asByteArrayInputStream( resourceContent );
		var awsResponse = s3Client.putObject( awsBucket, normalizedPath, resourceStream, objectMetadata );

		return({
			awsResponse: awsResponse
		});

	}


	// ---
	// PRIVATE METHODS.
	// ---

	/**
	* I create a byte-array input stream from the given binary value.
	* 
	* @value I am the binary payload being wrapped.
	*/
	private any function asByteArrayInputStream( required binary value ) {

		return( createObject( "java", "java.io.ByteArrayInputStream" ).init( value ) );

	}


	/**
	* I create the S3 Client instance that this component is wrapping.
	*/
	private any function createS3Client() {

		// Load some class that we'll use for static references.
		var ProtocolClass = loadClass( "com.amazonaws.Protocol" );
		var RegionClass = loadClass( "com.amazonaws.regions.Region" );
		var RegionsClass = loadClass( "com.amazonaws.regions.Regions" );

		var clientConfig = loadClass( "com.amazonaws.ClientConfiguration" ).init();
		clientConfig.setProtocol( ProtocolClass.HTTPS );
		clientConfig.setMaxErrorRetry( 4 );

		var clientCredentials = loadClass( "com.amazonaws.auth.BasicAWSCredentials" )
			.init( awsAccessID, awsSecretKey )
		;

		var clientOptions = loadClass( "com.amazonaws.services.s3.S3ClientOptions" )
			.init()
			.withPathStyleAccess( true )
		;

		var clientRegion = RegionClass.getRegion( RegionsClass.fromName( awsRegion ) );

		var s3Client = loadClass( "com.amazonaws.services.s3.AmazonS3Client" )
			.init( clientCredentials, clientConfig )
		;
		s3Client.setS3ClientOptions( clientOptions );
		s3Client.setRegion( clientRegion );

		return( s3Client );

	}


	/**
	* I load the Java class with the given name (using the AWS SDK JAR files).
	* 
	* @className I am the class to load.
	*/
	private any function loadClass( required string className ) {

		return( createObject( "java", className, jarPaths ) );

	}


	/**
	* I normalize the given path, stripping off any leading slashes.
	* 
	* @resourcePath I am the path being normalized.
	*/
	private string function normalizeResourcePath( required string resourcePath ) {

		return( resourcePath.reReplace( "^[\\/]+", "", "one" ) );

	}

}

I love being able to load JAR files on-the-fly in Lucee CFML. Just one of the many language features that I couldn't live without at this point. Lucee CFML is so darn wonderful!

Epilogue On Storing Transient S3 Paths In A Database

When I first started noodling on this approach, one idea that I had was to store S3 paths in a database table. Then, I could have my background task scan the database table for "expired" paths which it would then delete from S3. The benefit of this approach is that I wouldn't have to arbitrarily "list" objects on S3 - I would own the "source of truth" for which files had to be deleted. It would also allow me to potentially set different expiration dates for different paths.

The more I thought about this approach, however, the more I came to think of it as unnecessary complexity. In fact, having to keep a remote system (S3) in sync with a local system (database), it introduced points of failure which my Putting and Deleting algorithms would then have to manage. Treating the S3 bucket as the "source of truth" just removes all that complexity. And, if I don't run the background task all that often, then I can't imagine there will be much of any cost or performance penalties.

Want to use code from this post? Check out the license.

Short link: https://bennadel.com/3900

Reader Comments

Oh my chickens, this post is old!

Hit me up on LinkedIn if you want to discuss it further.