After recently learning about the SQL PATINDEX() function from Nathan Mische and Sana, I did some Googling to find out more about it. In my research, I came across another SQL function that I didn't know about: CHARINDEX(). Both functions, PATINDEX() and CHARINDEX(), take a substring and find its one-based position within a given string or field value. The difference is that PATINDEX() looks for patterns (like the SQL LIKE construct) and CHARINDEX() just searches for string literals.
This seemed like a really exciting new function, CHARINDEX(), because the majority of my searches do not involve pattern matching and any time I can get away with literal string matching, I just know it's gonna be faster. After some preliminary tests, it seemed like CHARINDEX() was about 100% faster than both PATINDEX() and the SQL LIKE clause, executed in half the time. This was very exciting, but then I noticed something strange: SQL CHARINDEX() was not returning all of the records that PATINDEX() and LIKE were returning. It got most of them, but missed a few records here and there.
After a whole lot of Googling and reading up on the SQL documentation, I could find no mention of this and no notes on any kind of limitation of SQL's CHARINDEX() function. I was just about to give up when I noticed a pattern emerging; CHARINDEX() was failing to find substrings on text values where the index of the substring was very high, usually above 7,000.
To test this theory, I ran the following ColdFusion script which does a loop and populates an in-memory table and then searches for a substring. For each loop iteration, I am offsetting the substring by increments of 2,000 characters:
- To test to see if the T-SQL Function, CHARINDEX() has
- problems with large strings, we are going to start
- looping over an increasingly large prefix-buffer to see
- if CHARINDEX() can still find the matching substring.
- Bulid a huge padding string that we will put infront
- of the strings we insert into the data table.
- <cfset strX = RepeatString( "x", intPaddingSize ) />
- <!--- Test CHARINDEX(). --->
- <cfquery name="qTest" datasource="#REQUEST.DSN.Source#">
- Build the data table that we are going to populate
- with our increasingly large strings.
- @data TABLE (
- value TEXT
- Populate the data table with various values,
- each of which contains some variation of the
- substring "sexy".
- INSERT INTO @data
- SELECT '#strX#You are quite sexy in that dress'
- UNION ALL
- SELECT '#strX#Your sexy-factory is huge'
- UNION ALL
- SELECT '#strX#Sexyness is quite subjective'
- UNION ALL
- SELECT '#strX#You lookin SEXY, girl'
- UNION ALL
- SELECT '#strX#Those legs are straight-up sexy!'
- UNION ALL
- SELECT '#strX#Jodie Foster is highly !!SeXy!!'
- UNION ALL
- SELECT '#strX#"ColdFusion 8 Is Mad Sexy!"'
- Select the CHARINDEX() of sexy for each of the
- values above. Since we *KNOW* that each phrase
- contains "sexy", then we should get 6 non-zero
- values (CHARINDEX() is one-based).
- CHARINDEX( 'sexy', value )
- ) AS charindex
- <!--- Output query results. --->
- label="CHARINDEX() Test With #intPaddingSize# Padding"
- <br />
As you can see, for each loop iteration, I am adding 2,000 more leading "x" characters to the data values. Also notice that my "sexy" substring is represented in various character cases. I am doing this to demonstrate that CHARINDEX() is NOT case sensitive in my database (although, I think this has to do with the initial DB settings and is not a universal truth).
Running the above code, we get the following ColdFusion CFDump outputs:
| || || |
| || |
| || || |
As you can see, for the first 4 loop iterations, CHARINDEX() is successfully finding the "Sexy" substring. But, once our character padding gets up above 8,000, CHARINDEX() stops locating the proper substring index. Very strange! Both the SQL LIKE clause and the SQL PATINDEX() function continue to work on these high offsets, but for some reason, CHARINDEX() just punks out.
Any thoughts on this? Is this a database setting?
NOTE: The CHARINDEX() still fails to find the substring even if I use the optional third argument - search start index - that would start after the character padding. It looks like the function simply cannot handle large strings. I am not sure of this offhand, but it might be that it can only handle strings that can also be used in the SQL SUBSTRING() function (which craps out after 8,000 characters I think). Interesting.
Most string functions in SQL Server will have an 8,000 byte limit.
If you think about it, this makes sense based on the way SQL Server stores data. A row in a table has an 8060 byte limit, a char or varchar field has an 8,000 character limit, an nChar or nVarChar has a 4,000 character limit ( these are double byte fields so 4k characters = 8k bytes).
Once you get over 8k bytes you are into a TEXT data type which is stored differently from the rest of the data in the table. At this point your best bet for searching is to use full-text indexing and one of the following:
I smell another series of posts coming as you get into these explorations!
It sounds like it craps out on the 8k character limitation for the datatype you used (TEXT). Since you set up the in memory table to use TEXT, you probably couldn't buffer more than that into the column anyway, so maybe it is truncating without throwing an error; I know if you try to do an insert iinto a physical table column that exceeds the datatype length, it does throw an error though. Anyways...
This link indicates that if you are on SS2k5 you can use VARCHAR(MAX) and NVARCHAR(MAX) to get past the 4k and 8k limit (storing up to 2Gb). It mentions that the TEXT and NTEXT datatypes are still around for backward compatibility.
To be totally honest, Text Searching is and has always been my biggest Achilles Heel (I have many of them ;)). I have just never been good at text searching. Never done any full text indexing. I basically know the LIKE clause and that's it... and as a result (between you and me) my search functionality always sucks.
I definitely have to learn more about full text indexing. Spencer (a co-worker) did a presentation on it here at the office, but unfortunately, I had to go to the dentist that day (damn teeth!!)
So, yes, hopefully there will be some posting on this :)
That being said, I think it's odd that the String functions would have character limits. I understand what you are saying about the size of the row of data and how there is a max that way, but I don't see how that would directly affect the algorithm of a string function (unless they had some artificial cap in the actual logic of the function internals). I mean, after all, doesn't the CHARINDEX() not even require a database column? Can't it work off of two explicit strings:
CHARINDEX( 'dog', 'Snoop Doggy Dog' ) AS test
Here, we are not even touching a database table. I don't like this whole character limit thing... it feels too arbitrary.
I just tested it:
... ( PATINDEX( '%sexy%', value ) ) AS patindex,
... ( CHARINDEX( 'sexy', value ) ) AS charindex
to see if it crapped out (truncated without throwing an error), but it does not. The "value" returned contains the entire string and the PATINDEX() calls still work (return the very high index offset).
Here, we are still on SQL Server 2000, but that is a good tip to know about the VARCHAR(MAX) thing. When we upgrade, I will certainly pass that around.
@Ben - Thanks for turning me on to CHARINDEX() and its apparent limitations. In my case I think I could have used CHARINDEX() as the column I was searching was a VARCHAR column and I didn't really need to use wildcards. Now I know...
I only learned about CHARINDEX() when looking up PATINDEX() which you turned me on to... if there is anyone you should be thanking, sir, it is yourself :)
Also, I know you are using 2000, but according to SQL Server 2005 Books Online (http://msdn2.microsoft.com/en-us/library/ms186323.aspx): "CHARINDEX cannot be used with text, ntext, and image data types."
Good to know. I am gonna start some upgrading discussions at this week's staff meeting. We have been on 2000 for a long long time. It looks like the newer DBs are just faster and more powerful. It's time to live in the now. And, according to the some comments above, it would seem that TEXT datatypes are going to be kept only for backwards compatibility; they are replaced with VARCHAR(MAX).
Right, TEXT, NTEXT and IMAGE datatypes are not stored like other datatypes. Basically, the column contains a pointer to where the actual values are stored, which might also have another pointer that chains to another location where more of the value is stored. Since it is variable length and SQL doesn't really know how big the value will/can be (2Gb possible), the value can span lots of physical/logical blocks on the disk (you rarely get up to 2Gb of contiguous space on the drive). Actually, I only reason I was interested in this because I finally found a reason for the brain cells I wasted learning about physical storage 30 years ago ;-)
Anyway, in order to do anything like searches on TEXT et al, you have to do a bunch of other stuff to get pointers and all the data (pita). That's likely a big reason why MS has the other search indexing capabilities. I pretty sure I have Verity set up to do my searches on those datatypes, though.
Fwiw, you might notice in SQL Analyzer that a TEXT datatype shows it's value something like:
This column has data in...
It shows a little bit of the value followed by the ellipses.
@Ben - I don't know how important this is to your company, but one thing you may want to bring up in your upgrade discussions is that mainstream support for SQL Server 2000 ends on 4/8/2008.