Tracking Delta-Only Data Audits In ColdFusion Using XML
The other day, I explored the idea of an audit table that copies the entire record in question, creating a full-record snap-shot in time. In the comments to that post, there was some great discussion as to how to create audits and why we might want to do it a certain way. As we mostly agreed on, the type of auditing you do is highly dependent on the business strategy and what the audit is being used for. In an application that I am working on right now, the audit data is being used as a page element on many high level, administrative pages for quality control. As such, it was very important for a staff member to be able to pop in and see exactly what was changed and who changed it. Because of this, I found that I needed the audit trail to be very accessible from a human-readable stand point; to aide this effort, I tracked audits using XML strings, recording ONLY what changed from form action to form action.
As you can see from the video, with every audit, I am only capturing the information that was changed via a FROM and a TO field. This data is actually being stored in a single XML string. Because of this XML structure, my audit infrastructure changes. I no longer need a bunch of audit tables, I only need one:
- id [PKEY]
- crud [CHAR]
- data [TEXT]
- date_created [DATE/TIME]
Here, the crud field contains a single character: C,R,U,D for Create, Read, Update, and Delete. The data field contains the XML audit data. Then, any table that we are auditing needs a join table that relates it the audit records. For our contact demo, we created an audit_contact_jn table:
So, the table structure has change dramatically, but it is not really any better than before - we still need an additional table for ever table that we are auditing. The bonus is that it is only two primary keys, which is a bit nicer than a full-on duplicate of the original table.
Let's take a look at the code that makes this possible:
<!--- Param form data. ---> <cfparam name="FORM.id" type="numeric" default="0" /> <cfparam name="FORM.first_name" type="string" default="" /> <cfparam name="FORM.last_name" type="string" default="" /> <cfparam name="FORM.title" type="string" default="" /> <cfparam name="FORM.company" type="string" default="" /> <cfparam name="FORM.submitted" type="boolean" default="false" /> <!--- Check to see if the form as been submitted. ---> <cfif FORM.submitted> <!--- Normally, this is where we would validate data, but for this demo, we are not going to worry about valid data. ---> <!--- Query for the current record. We need to get this before we update / insert so that we can compare the previous and new values before the previous values are updated. ---> <cfquery name="qContact" datasource="#REQUEST.DSN#"> SELECT c.first_name, c.last_name, c.title, c.company FROM contact c WHERE c.id = <cfqueryparam value="#FORM.id#" cfsqltype="cf_sql_integer" /> </cfquery> <!--- Check to see if are inserting or updating data. ---> <cfif FORM.id> <!--- Update the record. ---> <cfquery name="qInsert" datasource="#REQUEST.DSN#"> UPDATE contact SET first_name = <cfqueryparam value="#FORM.first_name#" cfsqltype="cf_sql_varchar" />, last_name = <cfqueryparam value="#FORM.last_name#" cfsqltype="cf_sql_varchar" />, title = <cfqueryparam value="#FORM.title#" cfsqltype="cf_sql_varchar" />, company = <cfqueryparam value="#FORM.company#" cfsqltype="cf_sql_varchar" /> WHERE id = <cfqueryparam value="#FORM.id#" cfsqltype="cf_sql_integer" /> </cfquery> <!--- Set audit type. ---> <cfset strAuditType = "U" /> <cfelse> <!--- Insert the new record. ---> <cfquery name="qInsert" datasource="#REQUEST.DSN#"> INSERT INTO contact ( first_name, last_name, title, company ) VALUES ( <cfqueryparam value="#FORM.first_name#" cfsqltype="cf_sql_varchar" />, <cfqueryparam value="#FORM.last_name#" cfsqltype="cf_sql_varchar" />, <cfqueryparam value="#FORM.title#" cfsqltype="cf_sql_varchar" />, <cfqueryparam value="#FORM.company#" cfsqltype="cf_sql_varchar" /> ); <!--- Return the new ID. ---> SELECT ( LAST_INSERT_ID() ) AS id ; </cfquery> <!--- Store the new id into our form variable. ---> <cfset FORM.id = qInsert.id /> <!--- Set audit type. ---> <cfset strAuditType = "C" /> </cfif> <!--- ASSERT: At this point, we have either created or updated the given record and the ID of that record is stored in FORM.id. ---> <!--- Now that we have our new contact data in place, let's create our audit trail via an XML document. Initially, this XML value will hold every field; we will then filter it based on what has changed. ---> <cfxml variable="xmlAuditData"> <cfoutput> <audit> <field label="First Name"> <from>#XmlFormat( qContact.first_name )#</from> <to>#XmlFormat( FORM.first_name )#</to> </field> <field label="Last Name"> <from>#XmlFormat( qContact.last_name )#</from> <to>#XmlFormat( FORM.last_name )#</to> </field> <field label="Title"> <from>#XmlFormat( qContact.title )#</from> <to>#XmlFormat( FORM.title )#</to> </field> <field label="Company"> <from>#XmlFormat( qContact.company )#</from> <to>#XmlFormat( FORM.company )#</to> </field> </audit> </cfoutput> </cfxml> <!--- Now that all of our XML data is in place, we want to filter it down to only the data that has changed value during the current form action. Therefore, let's delete any node where the FROM and TO values are the same. ---> <cfloop index="intIndex" from="#ArrayLen( xmlAuditData.audit.XmlChildren )#" to="1" step="-1"> <!--- Get short hands to nodes. ---> <cfset xmlFrom = xmlAuditData.audit.XmlChildren[ intIndex ].from /> <cfset xmlTo = xmlAuditData.audit.XmlChildren[ intIndex ].to /> <!--- Check to see if the XML text values for the FROM and TO nodes are different. ---> <cfif (xmlFrom.XmlText EQ xmlTo.XmlText)> <!--- These two text value are the same, so no changes were made. Delete this node from the audit data. ---> <cfset ArrayDeleteAt( xmlAuditData.audit.XmlChildren, intIndex ) /> </cfif> </cfloop> <!--- Insert the new audit record. We should not only have the fields that have changed in this form action. ---> <cfquery name="qInsertAudit" datasource="#REQUEST.DSN#"> INSERT INTO audit ( crud, data, date_created ) VALUES ( <cfqueryparam value="#strAuditType#" cfsqltype="cf_sql_char" />, <cfqueryparam value="#xmlAuditData#" cfsqltype="cf_sql_varchar" />, <cfqueryparam value="#Now()#" cfsqltype="cf_sql_timestamp" /> ); <!--- Get new audit id. ---> SET @id = LAST_INSERT_ID(); <!--- Insert the join to the contact. ---> INSERT INTO audit_contact_jn ( audit_id, contact_id ) VALUES ( @id, <cfqueryparam value="#FORM.id#" cfsqltype="cf_sql_integer" /> ); </cfquery> </cfif> <!--- Query for all record audits for this contact. ---> <cfquery name="qAudit" datasource="#REQUEST.DSN#"> SELECT a.id, a.crud, a.data, a.date_created FROM audit a INNER JOIN audit_contact_jn acjn ON ( a.id = acjn.audit_id AND acjn.contact_id = <cfqueryparam value="#FORM.id#" cfsqltype="cf_sql_integer" /> ) ORDER BY a.id DESC </cfquery> <cfoutput> <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"> <html> <head> <title>ColdFusion Audit Table Demo</title> </head> <body> <h1> ColdFusion Audit Table Demo </h1> <form action="#CGI.script_name#" method="post"> <!--- Send submission flag. ---> <input type="hidden" name="submitted" value="true" /> <!--- Send back the ID of the record we are working with. This will be zero the first time. ---> <input type="hidden" name="id" value="#FORM.id#" /> <p> Name:<br /> <input type="text" name="first_name" value="#FORM.first_name#" size="16" maxlength="20" /> <input type="text" name="last_name" value="#FORM.last_name#" size="19" maxlength="30" /> </p> <p> Title:<br /> <input type="text" name="title" value="#FORM.title#" size="40" maxlength="50" /> </p> <p> Company:<br /> <input type="text" name="company" value="#FORM.company#" size="40" maxlength="50" /> </p> <p> <input type="submit" value="Submit" /> </p> </form> <br /> <h2> Contact Audits </h2> <table cellspacing="2" cellpadding="4" border="1"> <thead> <tr> <th> Audit Type </th> <th> Date </th> </tr> </thead> <tbody> <!--- Loop over audit records. ---> <cfloop query="qAudit"> <tr> <td> <cfswitch expression="#qAudit.crud#"> <cfcase value="U"> Updated </cfcase> <cfcase value="C"> Created </cfcase> </cfswitch> </td> <td> #DateFormat( qAudit.date_created, "mm/dd/yy" )# #TimeFormat( qAudit.date_created, "hh:mm TT" )# </td> </tr> <tr> <td colspan="2" style="padding: 20px ; background-color: ##F0F0F0 ;"> <!--- Get the XML for delta in this audit record (the delta being only the data fields that changed in this form action). ---> <cfset xmlAudit = XmlParse( qAudit.data ) /> <table cellspacing="2" cellpadding="4" border="1"> <tr> <th> Field </th> <th> To </th> <th> From </th> </tr> <cfloop index="xmlField" array="#xmlAudit.audit.XmlChildren#"> <tr> <td> #xmlField.XmlAttributes.label# </td> <td> #xmlField.to.XmlText#<br /> </td> <td> #xmlField.from.XmlText#<br /> </td> </tr> </cfloop> </table> </td> </tr> </cfloop> </tbody> </table> </body> </html> </cfoutput>
No doubt about this, when compared to my previous example, there is a lot more overhead to preparing the audit data. Because of this, it is certainly something that cannot be performed via a trigger. Now, despite the additional overhead of this method, I think it does have some advantages. For starters, we only store the data that was altered in a given form action. I know these days hard drive space is so cheap, it shouldn't matter, but I think there is something elegant about only storing what has changed. Along with this, we also get the added benefit of being able to tell at a quick glance what changed in a given form action; since we are only storing the values that changed, we no longer have to look at the previous audit record to see what changes were made. In my application, where that is one of the primary gestures of the Staff members, I have found this to be very efficient.
Storing the audit as an XML string also has the enormous benefit of zero coupling to the original table. Other than the ID in the join table, the audit data is in no way tied to the original data entity. Because we are storing the data points as well as the LABEL (which is very different than the database field name), the original data table can change structure without corrupting old audit records. In my video, I give the example of changing the "title" field to "position"; even with a change such as that, the old audit records would be able to output "Title" rather than "Position" since part of the audit is the user-friendly label we are going to display.
One huge benefit of storing Label vs. database field name might not be apparent at first; where it becomes hugely advantageous is when the field that you are recording is not a standard input field, but rather a foreign key field. To an end user, changing "company_id" from 45 to 904 has zero meaning. However, changing the label "Company" from "Xerox" to "Apple" is much more meaningful with no display overhead. Also, rather than storing foreign keys, we store foreign key values - another way in which our XML style auditing has zero coupling to the rest of the system, making it a true snapshot in time. I am not explaining that very well, but I mean to say that we are storing the "human intent" of the audit rather than the database implementation. Subtle but very powerful difference.
Also, because of this low coupling to the original table, the output of the audit data for the administrative user can be very generic; all you need to do is pass in the name of the join table and the foreign key and you can output the fields and data audits using just the XML data. No information about the primary data entity needs to be known at display time.
Now, there are some down sides to this as well. Because we only have one audit table, it is, potentially, going to get huge. And, because the XML data is primarily-table-agnostic, it would be harder to rebuild the current table based on the audit trail. Xml is also hard to, but not impossible, to report on.
So, like I said before, the method of auditing you chose to do is dependent on the way in which the audit data needs to be used. Since this application is use the audit trail in a very hands-on, checks-and-balances type of way, it was important for me to be able to very quickly display the data that was changed, who changed it, when it was changed, and what elements where changed in a very human readable way.
NOTE: In my actual implementation of this, the XML data is minimized before it is inserted. For simplicity's sake, I am not including this functionality in the demo.
Want to use code from this post? Check out the license.
At the risk of opening this can of worms again (I really am doing it to help you learn/think, not just to poke holes), I don't agree with your statement that "No doubt about this, when compared to my previous example, there is a lot more overhead to preparing the audit data. Because of this, it is certainly something that cannot be performed via a trigger."
While a trigger might not be the best approach (or outside of your personal comfort zone), a trigger could be used to accomplish this same thing. For the purposes of this explanation, consider a db trigger to be similar to a Directory Watcher event gateway in CF. The trigger can be attached to certain types of events (similar to your onAdd, onDelete, etc type methods in your CFC) - so that would cover populating your CRUD field. Within the scope of the trigger, you have access to what has changed (in the case of an add it is all new data and deletes mean all old data - but in the case of a change/update, you can see both the old and the new values, to know which fields have changed). This would allow you at the very least to build up your xml string within the trigger (see ms-sql pseudo code below).
-- update trigger
create trigger on originalTableToWatch
declare @oldValue, @newValue
select @oldValue = del.myFieldToWatch
from deleted as del
select @newValue = ins.myFieldToWatch
from inserted as ins
insert into auditTable (crud,data,datecreated)
'<audit><oldValue>' + @oldValue + '</oldValue><newValue>' + @newValue + '</newValue></audit>'
While obviously this example gets a bit more complex as you look at a table with more than one field to keep track of (and you format your xml exactly as you want it), I just wanted to point out the possibilities with a trigger...
YMMV based on db engine, business/project requirements, and level of sql comfort...
I appreciate what you are saying, but realize that I am not simply storing the values; I am also storing the user-friendly "Label" rather than the column name as defined by the User Interface, not the database structure. In addition to that, I am also escaping the values to make them XML-Safe and putting in the "name" values of foreign keys rather than the IDs (not shown in the demo, but in the explanation).
Perhaps this can all be done via Triggers, but at that point, you would be using a trigger I assume only because you *had* to regardless of how large, non-readable, and cumbersome it got.
But then couldn't you just handle the user friendly "Label" on the display end of things? Your database doesn't need to know what the label is, just what column was accessed and changed. The UI is what needs to know what the user-friendly version of that label is (which has to be contained on the display as the column name is not the same as the "Label"), so your dB would insert the values of what was changed and which column, then the display would show that Address1 = Mailing Address. You could even have a separate table with tableName, columnName, and label for the columns so you can match everything up and not have to worry about passing a label to your dB. Trigger happy again :)
That is true - but then your Audit display becomes interface-specific. If you keep the label in the XML, then you have one generic Audit display for all the auditing in your system.
I am not saying that is worth it, I am just saying that it is one trade off depending on what you are trying to do.
Plus, anytime you change the column name (ex. Title - > Position), you have to put additional CFIF or CFCase statements in your output code.
Again, not the best or worst way to handle this. I just feel that triggers seem like more overhead than standard CF code at this point.
Again, I am not trying push a specific agenda or say that any one solution is ALWAYS any better than another - I was specifically taking issue with the statement that I quoted in my initial reply. Only you can be the final judge of what is best for your situation, I was just trying to encourage you to view things in terms of the best tool for the situation, rather than the tool that you were most comfortable (think back to projects that you did before discovering jQuery).
As always, excellent and very detailed writeup - notice that all of the comments on both this post and the last were about alternative solutions, not about anything wrong with how you implemented your approach!
I completely agree. I am definitely guilty of not know enough about advanced SQL stuff to make educated decisions about such areas. I think its important to remember that all of these posts and comments are really just a discussion, and even when we get heated or misunderstand each other, we are all really just searching for better ways of doing stuff. Good stuff.
Great follow-up to the previous post, love the reasons and the way you're using XML here. That is, to actually improve the screen usability of the audit data (FK conversion, Label, etc).
One question, though: why create the separate audit_contact_in table, rather than adding the table name or table ID as a column in your audit table? Seems like that would remove an entire set of DB management (creating a new mapping table for every entity to audit), without adding any additional coupling, since it really functions as a little more than a filter for your audit SELECT query in either case. (Saves a JOIN while SELECT'ing, too, although that's a less a big deal.)
I have gone back and forth on this issue. Sometimes I do store a "type" or a "table" column; other times, that really makes me feel uncomfortable and I use the join table. Sometimes, I just feel like having the name of a table in a column names is weird.... this is not powered by any rational though, just a gut feeling. And like I said, I have done both.
In this case, i think you are right, having the additional column would make this even more simple. And to be honest, using:
FROM #join_table# INNER JOIN on audit
... is no less weird than just putting the table in a column... either way, it feels funny, so why not go the way that creates less work.
Good point my man.
I agree on the discomfort ... feels like duplicating a sys_objects reference or something, but I like having the table_name reference in there, I think, because it's kinda like data encapsulation. The audit table is solely responsible for holding audit data, no more, no less, and no dependencies.
May not be relevant or true ... just another of those 'feel' things ;-)
I think you may be right. And, for the amount of overhead it cuts down in the join table creation, I think just going with the flow would be good.
In reading this and the last post, I don't see why you can't have both combined into one? Honestly, I think the more comfortable you become with databases, the more likely the trigger solution will feel like the "right" one for you. I generally try to keep to fairly generic SQL because it makes it easier to maintain but the threat of switching database platforms is a) in reality rare for 98% of applications and b) not as big of a hurdle as it might seem. I've worked on Sybase, MSSQL, MySQL, SAP and Postgres and migrated from one to the other and the good news is that it's *really easy* to see when your database queries don't work and thus, track down and resolve. More so if you have unit tests.
Now, as for having the best of both worlds, most mainstream databases have very good internal procedural languages that are quite robust. MSSQL and Sybase have Transact-SQL, Oracle has PL/SQL and PostgreSQL will let you use PL/pgSQL, PL/TCL, PL/Perl, PL/PHP, PL/Ruby, PL/Scheme or PL/Java. You can imagine with all of that at your fingertips you could do some pretty sophisticated database-side operations.
These days I'm most familiar with PG, so here is an example article that shows how to build a changed-value-only single audit table using triggers. This approach guarantees you capture the changes regardless of how they are made (like the occasional sneak-onto-the-prod server and run a manual SQL query):
There are also a couple of plug and play projects on pgfoundry.org as well but I haven't evaluated them yet. I'm sure similar solutions exist for the other database servers as well.
Some food for thought...
this statement echoes my main reason behind feeling that the audit belongs in a trigger
"This approach guarantees you capture the changes regardless of how they are made (like the occasional sneak-onto-the-prod server and run a manual SQL query):"
our occasional is a bit more frequent than I'd like to say :)
Is there a way to capture the user's IP address when you use a trigger-based audit? I am just curious as to how people handle security-based concerns when using triggers (such as capturing IP addresses or the ID of the user that is logged in).
I have not used triggers much, but there is no direct way for the trigger to pick up the client info. Any data available to the RDBMS would have to be passed from CF, since the only connection that exists to the database server is the CF server, not the user's browser. So, if the user info (username, ID, IP, whatever) isn't passed into the original table, I'm not sure how the trigger could get that info to fire off its own INSERT.
That would be passed in in the original query (that the trigger is initiated from).
Can you expand on that a bit more?
Hey Ben, a few months ago I was right where you are now... I had been down the path of storing xml data in the DB using application code, but recently when really studying SQL and deciding I wanted to go to the next level, I started really examining the AdventureWorks DB (among other things of course). One thing of interest to you is the DatabaseLog table. It does what you want via a global trigger.
HOWEVER, it does not provide a way to pass in a username, sessionID, IP address, etc (The login name the table refers to is the connection login, like the one CF uses.
The explanation for the advwks DB is here (It is the MSSQL sample DB):
Their audit is controlled by a global trigger that runs on each table.
Now, further along the lines of what web programmers need... I read something from Robert Vieira once that said we could treat SQL code as reusable the same way we treat our application code as reusable.
That being said, in all my code now I never update, insert, or delete directly from a table. I always use sprocs which if written correctly will always be just about as fast if not faster than a updating tables directly. Then you get the added bonus of SQL servers error catching/rollback/etc features. I'll send a code snippet if you want.
Anyway, in my sprocs I simply started passing in the UserId, sessionId, and IP, whether needed during the update or not. Then my global trigger updates the audit table like msft does adventureWorks.. Now most would think this is a pain in the ass, but it really has its advantages... I just think of it as a framework for DB calls. Now when someone new works on the code, we tell them to pass those 3 queryParams through and we're worry free the rest of the way. Actually I actually usually write the DB interfaces and never let others work on it... Kind of a "too man cooks" scenario...
One final thing.. as for your ''friendly' labels, maybe you can translate the tables on the way out instead of the way in? It would be a case statement that would have the same amount of entries as the amount of tables I guess. Once again, the small price to pay for nice, reusable code...
we have a fair amount of luck with something we are using based on this http://www.nigelrivett.net/AuditTrailTrigger.html (on sql server 2k)
and in cases where you need to know who, beyond the db user, for like ip or username, just include those as fields in the table and update/create/delete their value as well.
What they said :)
"we have a fair amount of luck with something we are using based on this http://www.nigelrivett.net/AuditTrailTrigger.html (on sql server 2k)"
This is a great approach, but let's say take the following example of data in the audit table:
So, in the above example, a Book was updated and the author and category were changed, to illustrate a simple example.
Now, when you want to create a report of the changes, you can't just query the audit table and dump the data, as those Id's in the old and new value mean nothing to the user.
So naturally what is required is a JOIN to the author and category table to get the descriptions linked to the ID, but how are you going to do this using the above mentioned approach?