I was looking at someone's SQL error the other day when it turned out that the dude was trying to use a TEXT data field in a GROUP BY clause. This is not allowed (at least in SQL Server). I am not exactly sure why (either it's too costly or database structure is not built to grab text blobs on the fly?) but it got me thinking about comparing very large strings (such as the comparison that would be in a GROUP BY).
What happens when you compare a string of any length? Does the cost of the comparisons relate to the lengths of the strings in question? Let's see:
<!--- Store a large string which will then used to build an even bigger string (FYI: I wrote this as a 300 word count assignment in Creative Writing at Tufts). ---> <cfsavecontent variable="strTextA"> She looked peaceful; eyes closed, head tilted down, breathing - controlled and soft. Areas of fabric, turned dark with sweat, stuck to her body revealing the grizzly figure below it. Cut-off army fatigues could do little to hide her massive and striated legs. Femininity with a warrior's touch. Exhaling sharply, she opened her eyes and mounted the machine next to her. With pads resting on her shoulders and white- knuckled fists grabbing at the handles, she took one deep breath, held it, then lifted the weight from its stack. Her face, once soft and peaceful, was now plagued with pain. She tried to control her body, which began to quake violently beneath the load of five hundred extra pounds. At best, she was able to stop her knees from buckling. She began to slowly lower the heel of her foot down beyond the level of her toes. Resting at the bottom for no more than a split second, she groaned loudly and exploded upwards, flexing her engorged calves as hard as she could. Back down and up, and again and again. With every rep came the surfacing of a new vein, a new ripple, new growth. Then, one the last rep, she held it at the top, biting down, trying to fight the pain. And when she could not hold it anymore, she collapsed. The iron collided with its cradle as she collided with the floor. There she lay, chest heaving, desperately trying to fill her lungs. This time however, she did not try to control her twitching legs. This time she smiled. </cfsavecontent> <!--- Repeat the above string 40 times. This will generate a string that is 61,161, certainly a hard string to compare??? ---> <cfset strTextA = RepeatString( strTextA, 40 ) /> <!--- Store a copy of A into B. ---> <cfset strTextB = strTextA /> <!--- Compare the two strings. Does this have to compare every character? If so, it would be tens of thousands of characters to compare. ---> <cftimer label="EQ Operator" type="outline"> Equals: #(strTextA EQ strTextB)# </cftimer>
Running the above code, we get:
... and it runs in 0ms. That's instantaneous! So, what's going on? While I have not been educated in this formally, I believe that this is what the Hash code is for in all Java objects. As the string is constructed, the internal hash code of the string is updated. When one string is compared to the other, it must use this (instantaneous) rather than comparing every character.
To look into this a bit more, let's dump out the hash code before and after:
Pre Alteration:<br /> #strTextA.HashCode()#<br /> #strTextB.HashCode()#<br /> <!--- Alter both variables in such a way that the two strings cannot be the same. I am using a RandRange() here to ensure that the speed is NOT due to compilation optimization of a static string. ---> <cfset strTextA = ( strTextA & ListGetAt( "A,a", RandRange( 1, 2 ) ) ) /> <cfset strTextB = ( strTextB & ListGetAt( "B,b", RandRange( 1, 2 ) ) ) /> Post Alteration:<br /> #strTextA.HashCode()#<br /> #strTextB.HashCode()#<br />
This gives us the following output:
As you can see, prior to the final edit, both strings have the same hash code even though they are not the same object. Then, once the strings are slightly altered, the hash codes are slightly different (indicating that the strings do not contain the same value).
But remember, ColdFusion comparison is NOT case sensitive. In ColdFusion "Test" EQ "TEST" EQ "TeSt". In that case, strings that are not technically the same might still have the same ColdFusion equality. Let's see what the hash codes of two different case strings are:
#ToString( "ABCDEF" ).HashCode()#<br /> #ToString( "AbCdEf" ).HashCode()#<br />
This gives the us the following output:
Here, these two strings do not have the same hash code and yet:
Equals: #("ABCDEF" EQ "AbCdEf")#
... gives us:
So obviously, there's something more than simple hash code comparisons going on. Also, a hash does not garuntee uniqueness. A hash value only guaruntees that two objects of the same value will have the same hash. So, after all this, I am still not sure how ColdFusion (on top of Java) does string comparison soooo freaking fast. I will just have to settle for the fact that ColdFusion is freakin-sweet!
I am sure this kind of stuff is covered on Day One of Java school, but this kind of stuff is fun for me to explore.
Want to use code from this post? Check out the license.