Recording Datadog / StatsD Gauges For Database Key Utilization In Lucee CFML 5.3.7.47

By Ben Nadel

Published 2021-12-08 in ColdFusion, SQL

As a fast-follow to yesterday's post on using information_schema to inspect primary and secondary index key utilization in MySQL and Lucee CFML, I wanted to demonstrate how to then take that utilization information and persist it to a StatsD consumer, like Datadog, such that monitoring and alerting can then be applied. Unfortunately, this code isn't even live yet, so I don't have a graph to showcase. But, I thought it would be worth sharing, regardless.

To recap the contextual issue, when you have a numeric column in a database, there's a maximum value that said column can contain. The max depends on the data-type of the column. For example, in MySQL 5.7, an INT can contain a max value of 2,147,483,647 whereas an INT UNSIGNED can contain a max value of 4,294,967,295. If the column in question is an AUTO_INCREMENT column, this value will creep up over time. Most likely, this growth will be slow enough to never become an issue. However, one day, given enough time and enough growth, you may run into the max capacity of said column. At which point, stuff just breaks!

Wouldn't it be nice to know when that's going to happen? And, wouldn't it be nice to know with enough heads-up time to actually prepare for it in advance? To do this, we can look at how much relative space we've used in our numeric database columns. We can then take those values and record them as StatsD metrics such that we can watch the growth rate of the columns over time. And, we can use a SaaS (Software as a Service) provider like Datadog to then apply monitoring and alerting over these metrics such that we can be alerted when columns start to reach a certain threshold.

After yesterday's post on MySQL's information_schema, I simplified the SQL logic and then broke it up into various ColdFusion components that represented a cleaner separation of concerns:

DatabaseKeyGateway.cfc - This ColdFusion component contains the data-access methods that interact with the information_schema and allow each numeric key column to be inspected for its current max value.
DatabaseKeyService.cfc - This ColdFusion component contains a method for gathering and aggregating statistical information about all the numeric key columns in the database.
WatchDatabaseKeysTask.cfc - This ColdFusion component is intended to be invoked as a scheduled task and translates the database stats into Datadog / StatsD gauge recordings.

In yesterday's post, I had two separate SQL queries for schema reflection: one for primary indices and one for secondary indices. In the end, I didn't need that separation. In the end, they're both just indexes and they can be inspected in a uniform way. As such, my DatabaseKeyGateway.cfc only has a single method for gather key-column information.

Also, in yesterday's post, I knew which environment I was running in. As such, I knew the name of the database schema to look up in the information_schema database. However, in a production context, I — as a developer — may not know the name of the database that I'm inspecting (since it may be getting auto-provisioned by the Platform. As such, I had to make the schema logic dynamic.

Luckily, MySQL provides us with DATABASE(), a built-in function that reports the name of the schema that is currently being used. When a ColdFusion datasource is configured, it is configured for a specific schema. As such, if I invoke the DATABASE() method within a <cfquery> tag, it will implicitly provide me with the schema name of the production database. I can then use to drive my information_schema look-ups.

That said, here is my data-access layer for inspecting / reflecting database key utilization:

component
	output = false
	hint = "I provide data-access methods for monitoring the database key-space."
	{

	/**
	* I get the columns that represent numeric keys in the database.
	*/
	public query function getColumns() {

		```
		<cfquery name="local.results">
			/*
				Since the database schema may not have the same name in every environment
				we'll use the contextual database used by the ColdFusion data-source.
			*/
			SET @schemaName = DATABASE();

			/* DEBUG: databaseKeyGateway.getColumns(). */
			SELECT
				t.table_name,
				s.column_name,
				s.index_name,
				c.column_type
			FROM
				information_schema.TABLES t
			INNER JOIN
				information_schema.STATISTICS s
			ON
				(
						t.table_schema = @schemaName
					AND
						s.table_schema = t.table_schema
					AND
						s.table_name = t.table_name
					AND
						/*
							Some tables have compound keys - that is, a non-synthetic
							primary key that is composed of multiple columns. For the
							sake of our monitoring, we're going to assume that the first
							column is the one that is likely to run out of space. As
							such, we're going to limit the results to columns that are
							the FIRST column in the sequence of indexed-columns.
						*/
						s.seq_in_index = 1
				)
			INNER JOIN
				information_schema.COLUMNS c
			ON
				(
						c.table_schema = t.table_schema
					AND
						c.table_name = t.table_name
					AND
						c.column_name = s.column_name
					AND
						/*
							For the sake of our monitoring, we only care about INT vs.
							UNSIGNED INT. While there are other types of numeric values,
							those are less likely to cause a problem for us.
						*/
						c.data_type = 'int'
				)
			ORDER BY
				c.table_name ASC,
				c.column_name ASC
			;
		</cfquery>
		```

		return( results );

	}


	/**
	* I get the last key value stored in the given column.
	* 
	* CAUTION: This query is SAFE to run because - and ONLY BEACUSE - we are expecting
	* the given column to the first column in an index on the given table. This should,
	* therefore, use a COVERING INDEX to pull this value right out of memory.
	*/
	public numeric function getLastColumnValue(
		required string tableName,
		required string columnName
		) {

		```
		<cfquery name="local.results">
			/* DEBUG: databaseKeyGateway.getLastColumnValue(). */
			SELECT
				( #columnName# ) AS value
			FROM
				#tableName#
			ORDER BY
				#columnName# DESC
			LIMIT
				1
			;
		</cfquery>
		```

		return( val( results.value ) );

	}

}

As you can see, we use a single SQL query to gather the column meta-data and a single SQL query to gather the max value of a given column. For our monitoring purposes, I'm only worrying about INT columns. There are other types of numeric columns with theoretical maximums; however, the only ones we're likely to run into are INT.

The next layer up - DatabaseKeyService.cfc - then uses the above gateway to aggregate the information into something that can be easily consumed by our scheduled task:

NOTE: At work, we use a dependency-injection (DI) framework that uses the <cfproperty> tag as a way to define dependencies. As such, assume that the databaseKeyGateway component is being injected behind the scenes using one of the generated accessor methods.

component
	accessors = true
	output = false
	hint = "I provide service methods for monitoring the database key-space."
	{

	// Define properties for dependency-injection.
	property databaseKeyGateway;

	// ---
	// PUBLIC METHODS.
	// ---

	/**
	* I get the database key usage stats.
	*/
	public array function getKeyStats() {

		var columnsQuery = databaseKeyGateway.getColumns();
		var stats = [];

		loop query = columnsQuery {

			var indexType = ( columnsQuery.index_name == "PRIMARY" )
				? "Primary"
				: "Secondary"
			;
			var maxIntValue = columnsQuery.column_type.findNoCase( "unsigned" )
				? 4294967295
				: 2147483647
			;
			var lastColumnValue = databaseKeyGateway.getLastColumnValue(
				columnsQuery.table_name,
				columnsQuery.column_name
			);

			// Calculate the relative percentage of key-space that we used for this
			// column (based on the current data-type).
			var percentUsed = ceiling( lastColumnValue / maxIntValue * 100 );

			stats.append({
				table: columnsQuery.table_name,
				column: columnsQuery.column_name,
				indexType: indexType,
				currentValue: lastColumnValue,
				maxValue: maxIntValue,
				percentUsed: percentUsed
			});

		}

		return( stats );

	}

}

As you can see, for each column in the introspection query we then go back to the data-access layer and look-up the current maximum value in said column. Using the current value and the column data-type, we can then calculate what percentage of INT space we've used-up. It's this percentage value that we want to graph over time so we don't get caught unawares.

This graphing - or rather, the translation of these percentages into StatsD gauges - is performed by our last ColdFusion component, WatchDatabaseKeysTask.cfc:

component
	accessors = true
	output = false
	hint = "I watch the database key-space to make sure it isn't running out of space!"
	{

	// Define properties for dependency-injection.
	property databaseKeyService;
	property datadog;

	// ---
	// PUBLIC METHODS.
	// ---

	/**
	* I inspect the numeric key-columns in the database and record stats regarding how
	* much of their INT-space has been used.
	*/
	public void function execute() {

		// Since there are HUNDREDS of numeric keys in the database, let's limit our
		// results to columns that reach a threshold of interest. This should give us
		// plenty of time to react without as much noise.
		var stats = databaseKeyService.getKeyStats().filter(
			( stat ) => {

				return( stat.percentUsed >= 20 );

			}
		);

		for ( var stat in stats ) {

			datadog.gauge(
				key = "database.key_space",
				value = stat.percentUsed,
				tags = [ lcase( "column:#stat.table#.#stat.column#" ) ]
			);
		}

	}

}

Since our database had hundreds of numeric keys, it would be quite noisy to graph all of them all of the time. Instead, I'm waiting until a particular column hits 20% key-space utilization (that's 858M on an INT UNSIGNED column) before recording it as gauge. This way, we'll limit our graph to columns that actually stand a chance of hitting their maximum storage capacity.

Now, for this particular StatsD gauge, we're using a special feature provided by Datadog: tagging. Notice that the name of the metric is the same for all columns. Instead, of differentiating based on metric name, we're providing the table/column data as a tag on the metric. This allows us to create a graph in Datadog using a static metric name that is then broken down by tag. With such a graph, new tags will automatically show up (and be implicitly added to any existing monitoring and alerting) the moment we add a new numeric key column to the database!

Datadog is the bomb-diggity! I really with they had a free tier so that I could use them on personal projects.

Unfortunately, I haven't had a chance to deploy this code yet thanks to the massive AWS outage yesterday (Dec 7, 2021). When it's live and I have some data, I'll update this post with a screenshot of Datadog. That said, I'm pretty excited about this! Running out of key-space in the database is something that I've been concerned about for years! Getting monitoring and alerting in place is going to allow me to rest much easier.

Epilogue on Liquibase / Percona Toolkit Migrations

As I'm writing this post, something just occurred to me. When we run database migrations using Liquibase / Percona Toolkit, temporary tables get generated in the database (for online migrations using shadow tables). All of these tables start with an underscore (_). As such, it would probably make sense for me to add some filtering to my information_schema.TABLES query to exclude these temporary tables. Something like:

SELECT
	t.table_name,
	s.column_name,
	s.index_name,
	c.column_type
FROM
	information_schema.TABLES t
INNER JOIN
	information_schema.STATISTICS s
ON
	(
			t.table_schema = DATABASE()
		AND
			t.table_name NOT LIKE '\_%'
		AND
			s.table_schema = t.table_schema
		AND
			s.table_name = t.table_name
		AND
			s.seq_in_index = 1
	)
/* .... truncated .... */

Here, I'm using the LIKE operator to filter-out results where the table name starts with (_). Note that in the LIKE operator, the underscore is a special character (matching exactly one wild-card character). As such, we have to use the escape sequence \_ to make sure that we're matching on the literal underscore character.

Want to use code from this post? Check out the license.

Short link: https://bennadel.com/4166

Reader Comments

Oh my chickens, this post is old!

Hit me up on LinkedIn if you want to discuss it further.