Ask Ben: Selecting A Random Row From A Weighted, Filtered Record Set

Posted January 27, 2009 at 9:56 AM

Tags: ColdFusion, SQL, Ask Ben

This is the scenario to help explain my question so I hope it makes sense: On a form is a zip code search box where you enter a zip code to see the list of services or shops in your area (e.g. 90210). The query returns 10 businesses in that area. Now, the tricky part is that some of the business should appear at the top (or near the top) more often than not but all 10 results should always show up since certain clients paid more to be a part of the listing service (just like the sponsored links on Google/Yahoo, etc). I know Google Adwords does this but I don't know how they do it. Can anyone help me figure this out? I have found to be more challenging than I thought it would be so I thought I'd bug you. :) I have MS SQL 2000 and MS Access to play around with. I'm pretty sure this will just be getting a query written right but maybe there is more to it than that?

Selecting from a weighted set of values is not a complex thing in the abstract. In fact, if we are dealing with a static set of data, it's actually pretty easy and there a number of user defined functions that will help you with this problem. However, we're talking about selecting from a weighted datatable. This becomes much more complicated for several reasons:

The SQL is executed in the database, not the ColdFusion server, so we no longer have access to lots of great functions and procedural logic that we would have in a straight-up ColdFusion solution.

The result set can / must be filtered based on other business logic (ex. zip code) before the weight of the records can even have a contextual meaning.

This definitely throws a wrench in the works. I've been thinking about this problem ever since you asked it and I couldn't come up with a good solution until I was riding up in the elevator literally 15 minutes ago. I had a sort of eureka moment: Pivot Tables (thank you Rick Osborne, you magnificent bastard!).

A pivot table is a database table that has nothing but an ID column starting at 1 (one) and incrementing once per row for a finite number of rows. So, for example, you might have a table named pivot100 and that 100 rows with ID values 1 through 100. These pivot tables can be used to help enumerate values that otherwise would require special T-SQL.

We can use a pivot table to take our weighted records and, in essence, flesh them out into a larger, non-weighted record set. By that, I mean that we can convert a weight (ex. 50) into the equivalent number of non-weighted records: 50 records of no particular weight. This is hard to conceptualize, so let's look at a quick example.

Before I get into the example, however, I need to build up a database to be used in our examples. I am going to store my SQL in a ColdFusion variable to be simply executed at the top of each query:

 Launch code in new window » Download code as text file »

  • <!---
  • Define our boiler-plate data for the girl table. We are
  • defining this so that we can run later queries without
  • too much distraction. In reality, this table would be a
  • static table in our database.
  • --->
  • <cfsavecontent variable="strSQL">
  •  
  • <!---
  • Declare out temp table. Remember, this is going to be
  • used in lieu of a real database table.
  • --->
  • DECLARE @girl TABLE (
  • id INT IDENTITY( 1, 1 ),
  • name VARCHAR( 50 ),
  • weight INT
  • );
  •  
  • <!---
  • Populate our temp girl table with different names and
  • weights (not physical weight, the business weight -
  • girls of all different physical weights are beautiful).
  • --->
  • INSERT INTO @girl
  • (
  • name,
  • weight
  • )(
  • SELECT 'Sarah', 100 UNION ALL
  • SELECT 'Libby', 30 UNION ALL
  • SELECT 'Lisa', 30 UNION ALL
  • SELECT 'Molly', 250 UNION ALL
  • SELECT 'Kit', 50
  • );
  •  
  • </cfsavecontent>

Here, we are building up a @girl table with girl names and different weights (business weights, not physical weights). NOTE: You won't have to deal with temp table; I am doing this only because I don't have physical tables to work with.

Ok, now that we have that, let's see what happens when we join our @girl table to a pivot table. When we join the two tables together, we are going to match rows until the ID column of the pivot table is larger than the weight of the given @girl:

 Launch code in new window » Download code as text file »

  • <!--- Join the girl to the pivot table to flatten records. --->
  • <cfquery name="qGirl" datasource="#REQUEST.DSN#">
  • <!--- Build tables (to mimic database). --->
  • #PreserveSingleQuotes( strSQL )#
  •  
  • SELECT
  • g.id,
  • g.name,
  • g.weight,
  •  
  • <!--- Get pivot ID for debugging. --->
  • ( p.id ) AS pivot_id
  • FROM
  • @girl g
  • INNER JOIN
  • pivot1000 p
  • ON
  • g.weight >= p.id
  • </cfquery>
  •  
  • <!--- Output records. --->
  • <cfdump
  • var="#qGirl#"
  • label="Girl (x) Pivot"
  • />

When we output the result of this cross product, we get something that looks like this (I have shortened the output for ease of reading):

 
 
 
 
 
 
Girl Joined To Pivot Table. 
 
 
 

Notice that when we join the @girl table to the pivot1000 table, we have flattened the records. Now, rather than having a record with a weight of 100 (Sarah), we have 100 Sarah records (whose individual weights are no longer of consequence).

Do you see where this is going? Once we have this flattened record set, we simply need to select a random row from it. Assuming that the row selection is random, the weights of the original records will still apply since the heavier rows will have more records in our flattened table and therefore will be more likely to be picked by random selection.

When it comes to selecting a random record from a database table, the easiest method is to select the top row of a record set that has been sorted using MS SQL's NEWID(). The NEWID() method will return a new UUID (universally unique ID) for each invocation, giving each record in the table a temporary (not returned in actual record set) random UUID.

As you can see in the output above, the cross product with our pivot1000 table results in a rather large intermediary table (the record count is equal to the sum of all weights in the target table). Therefore, we really want to limit the amount of data that ends up in our intermediary table. As such, I would recommend getting only a random ID using the pivot table and NEWID() method and then using this randomly selected ID to further select the full record from our target table. This way, we don't need to use the full column data of our target table until we have already limited the recordset down to one row.

 Launch code in new window » Download code as text file »

  • <!--- Select random, weighted girl. --->
  • <cfquery name="qGirl" datasource="#REQUEST.DSN#">
  • <!--- Build tables (to mimic database). --->
  • #PreserveSingleQuotes( strSQL )#
  •  
  • <!--- Pick random, weighted record. --->
  • SELECT
  • g.id,
  • g.name,
  • g.weight
  • FROM
  • @girl g
  • INNER JOIN
  • (
  •  
  • <!---
  • In this inner query, we need to select a random,
  • weighted ID. We are doing this in the inner query
  • rather than in the outter query so that our
  • intermediary table doesn't need to contain so
  • much information (just the ID).
  • --->
  • SELECT TOP 1
  • g.id
  • FROM
  • @girl g
  • INNER JOIN
  • pivot1000 p
  • ON
  • (
  • <!--- Use the weights. --->
  • g.weight >= p.id
  •  
  • <!---
  • Use any additional filtering that is
  • required by the business logic of the
  • query criteria.
  • --->
  • AND
  • g.name != 'Lisa'
  • )
  • ORDER BY
  • <!--- Select random row. --->
  • NEWID() ASC
  •  
  • ) AS temp_id
  • ON
  • g.id = temp_id.id
  • </cfquery>
  •  
  • <!--- Output random row. --->
  • <cfdump
  • var="#qGirl#"
  • label="Random Girl"
  • />

By selecting the random ID as a sub-query, we don't need to move around full column data until we are pulling down the one girl record. In the sub-query, I am filtering the target table (excluding the girl named Lisa) merely to demonstrate where additional business logic filtering would go.

When we run the above code we get this output (which would be different each time):

 
 
 
 
 
 
Random Girl Selected From Filtered, Weighted Record Set. 
 
 
 

OK, you say - I see that one girl. But, does this really select random girls by weight?

Let's put it to the test: let's run this a bunch of times and see what we get back:

 Launch code in new window » Download code as text file »

  • <!--- Run the test many times. --->
  • <cfloop
  • index="intTestIndex"
  • from="1"
  • to="30"
  • step="1">
  •  
  •  
  • <!--- Select random, weighted girl. --->
  • <cfquery name="qGirl" datasource="#REQUEST.DSN#">
  • <!--- Build tables (to mimic database). --->
  • #PreserveSingleQuotes( strSQL )#
  •  
  • <!--- Pick random, weighted record. --->
  • SELECT
  • g.id,
  • g.name,
  • g.weight
  • FROM
  • @girl g
  • INNER JOIN
  • (
  •  
  • <!--- Select random ID. --->
  • SELECT TOP 1
  • g.id
  • FROM
  • @girl g
  • INNER JOIN
  • pivot1000 p
  • ON
  • <!--- Use the weights. --->
  • g.weight >= p.id
  • ORDER BY
  • <!--- Select random row. --->
  • NEWID() ASC
  •  
  • ) AS temp_id
  • ON
  • g.id = temp_id.id
  • </cfquery>
  •  
  •  
  • <!--- Output randomly selected girl. --->
  • #intTestIndex#) #qGirl.name#<br />
  •  
  • </cfloop>

When we run this (simplified query) 30 times, we get the following output:

1) Molly
2) Molly
3) Libby
4) Sarah
5) Molly
6) Libby
7) Libby
8) Molly
9) Sarah
10) Sarah
11) Kit
12) Molly
13) Kit
14) Lisa
15) Molly
16) Sarah
17) Molly
18) Molly
19) Lisa
20) Molly
21) Molly
22) Lisa
23) Sarah
24) Lisa
25) Lisa
26) Molly
27) Molly
28) Molly
29) Sarah
30) Molly

In this set, we get the following tally:

Sarah: 6
Libby: 3
Lisa: 5
Molly: 14
Kit: 2

If you look back at the original weights of the girls, you will see that this is more or less accurate. Molly, who had the largest weight is selected by far the most times. Sarah, who has the next highest weight and just under half that of Molly is selected just less than half the number of times Molly was selected. As we get down into smaller weights, the selection is less accurate; however, over a larger number of selections (N), I am sure you'll find that the statistics become much more accurate.

So, what was at first a seemingly complicated problem is actually something that we were able to boil down into a single query. The syntax of this will change for each database engine, but I believe that this should be possible in all professional database management systems. I hope that this has given you some good inspiration.

Download Code Snippet ZIP File

Post Comment  |  Ask Ben  |  Permalink  |  Other Searches  |  Print Page




Learning ColdFusion 9 - ColdFusion 9 tutorials, samples, examples, demos

Reader Comments

Jan 27, 2009 at 10:44 AM // reply »
32 Comments

An interesting approach. I just did some searches to see what others would do, and found this post. Basically take random number (0-1) multiply it by the weight and sort by that.

http://www.improve.dk/blog/2007/11/25/weighted-random-selections-in-sql-server


Jan 27, 2009 at 10:48 AM // reply »
9 Comments

Great info. I could have used this in the past!

Why would you want to select the women with the most weight first? :P


Jan 27, 2009 at 10:52 AM // reply »
6,516 Comments

@Josh,

Oooh, clever. I have seen many examples of people trying to use RAND() but as it only fires once per query it is usually ineffective. I've never actually seen an example (that I can recall) where someone used a NEWID() to seed RAND(). Very cool!


Jan 27, 2009 at 11:41 AM // reply »
55 Comments

@Ben,

I think your solution for attacking this problem is pretty slick. However, when I re-read the submitted question, a few questions of my own came up and thought I'd just throw a my thoughts into the mix.

If I understand the question correctly, all businesses "pay" to be listed. Some simply pay more than others ... so I'm assuming there's some sort of "sponsorship/subscription levels" in place. Just for simplicity, let's go with the ever-popular "Bronze, Silver and Gold." Where Bronze.Price < Silver.Price < Gold.Price. There could be many more, but I'm attempting to keep this simple.

I have to make an assumption that the "levels" are the "weight" by which I want to order my listings by. For example, Gold sponsors would be at the top of the list, then Silver, and lastly Bronze.

So, if I were building the database, I would most likely have a Sponsors/Businesses table ... so let's call it tblSponsors. In addition, I would probably have a table for the various sponsor levels ... let's call it tblLevels. Now, tblLevels could have a field called "LevelWeight." Let's say that Gold.LevelWeigth = 1, Silver.LevelWeight = 2 and Bronze.LevelWeight = 3.

Now, a simple query should bring up the desired results. For example,

SELECT SponsorName, SponsorLink, LevelWeight
FROM tblSponsors
INNER JOIN tblLevels ON tblSponsors.LevelID = tblLevels.LevelID
WHERE {whatever the selection criteria may be}
ORDER BY LevelWeight

This may be over-simplistic, but sometimes, that's how I like to approach things ... even though for every complicated question, there is a simple, elegant and "wrong" answer.


Jan 27, 2009 at 11:44 AM // reply »
55 Comments

Oh, and by the way ... yes, I understand my solution isn't taking into account "Randomness" ... but I didn't see where that was a requirement from the person that submitted the original question.


Jan 27, 2009 at 11:46 AM // reply »
6,516 Comments

@Steve,

It all depends what you are going for. You could even integrate the Sponsor levels table into the random, weighted query as well. Just depends on what you need.


Tom
Jan 27, 2009 at 12:06 PM // reply »
12 Comments

@Josh

I, too, had tried the rand() function but came up with effectively an ineffective solution. The newid() is a great touch. I was wondering if randrange(1,10) would work but I haven't tried it yet. Thanks for the link, you approach is short and simple.

@Ben,
Your details were excellent and I actually now have an understanding of pivot tables whereas before, I had none. What a very cleaver approach to solving the issue I had. You mentioned with a larger data set the results would be more accurate. With the scenario I presented, I'm only dealing with about 10 records for a given zip code so would the sort still be random enough (enough may be relative though)? I will try both approaches and see which one satisfies the requirements best. Thanks for your work, Ben, and I'm glad I challened you. :)

@Steve,
The sponsorship is the right idea, but to be fair to all, someone who pays the least should still have a "chance" at being the first on the list of results. All the companies will be listed for the zip code, it just depends on placement. I can't do by name or people will start naming their companies with a bunch of AAAAAA just to be listed first. :)


Jan 27, 2009 at 12:09 PM // reply »
6,516 Comments

@DLackey,

When I referred to (N) and the relationship to accuracy, I was not referring to the number of records from which you were selecting - I was referring to the number of times you actually make a random selection. For example, 100 random selections over an hour will be less "statistically accurate" than 2,400 random selections made of the course of an entire day.

The number of records to choose from should be irrelevant.

I'm glad that you are liking pivot tables more. They can be a super awesome tool, especially when working with dates.


Tom
Jan 27, 2009 at 12:17 PM // reply »
12 Comments

@Ben,

Regarding the (N) and the relationship to accuracy, thanks for clarifying.


Jan 27, 2009 at 12:19 PM // reply »
6,516 Comments

@DLackey,

Any time my man.


Jim
Jan 27, 2009 at 1:19 PM // reply »
16 Comments

Good post Ben. It's a use for Pivot tables that I haven't thought of.

Just to add to Ben's discussion, if you're running SQL 2005 or higher, you can use the built-in pivot function to by-pass some of the temp table logic.

You can check out the function and some examples here:

http://technet.microsoft.com/en-us/library/ms177410(SQL.90).aspx

-Jimmy


Jan 28, 2009 at 12:13 PM // reply »
31 Comments

Nice technical solutions.

What I've always found interesting for these things is more the business logic.

When a site is new, undergoing growth, and has small number of advertisers, the number of views each advertiser get is also growing.

Eventually user growth slows or plateaus (there are only so many people interested in requesting these location specific queries)

At that point if the number of advertisers increases, the number of views of their advertisements will decrease no matter what their ranking is.

This has hurt other small sites with this advertising model because the advertisers see decreasing returns on their investment as their number of views decrease.

Massive sites like Google can statistically bury these changes, and have also compensated for this by getting us to put their client ads in our sites, providing additional advertising slots beyond their own site to simulate user growth


Jan 28, 2009 at 12:19 PM // reply »
6,516 Comments

@Steve,

That's an interesting thought. I guess that's why a lot of sites have time-based viewing rather than weighted viewing. By that I mean, you pay X dollars to have your site visible ALWAYS for a given period of time. That way, you are not competing with other advertisers - you are simply paying for real estate.


Jan 28, 2009 at 1:03 PM // reply »
55 Comments

@DLackey,

To be honest, I'm having trouble with the "randomness" philosophy as applied here. If I paid for a "Gold" sponsorship level, then I would be pretty upset if someone else only paid for "Bronze" and rose above me. You used Google AdWords as an example in your original question ... well isn't that based on bidding ... the best spot for the highest bidder?


Tom
Jan 28, 2009 at 1:50 PM // reply »
12 Comments

It looks like this post is about to go off in a tangent so I'll provide a little bit more insight to keep the focus on "how to display results with some weighted randomness." When I originally posted the question and since I was referencing Google's Adwords, I thought the question could be made clearer if I kept the reference to money instead of making it over complicated by adding in the other factors (e.g. membership duration, participation in local charity events, and attendance in the monthly meetings) requiring the weighted randomness. The site is for a small networking group who all cater to the same service industry but limited coverage area (zip codes). Everyone in the networking group actually paid the same membership dues (to cover the cost of hosting of the website). The reason behind wanting the weighted randomness is because some members A) have been in the organization longer and B) participate more in the community by attending activities such as charity events and meetings while other members do the bare minimum to satisfy the membership requirements.

Hope that helps. :)


Jan 28, 2009 at 2:00 PM // reply »
6,516 Comments

@DLackey,

I've worked on add systems before that have weighted randomness based on how much they pay. I think that's legitimate. I think if you start to think in terms of classifications like Gold, Bronze, you forget that we're really just talking about monetary amounts.


Jan 28, 2009 at 4:28 PM // reply »
55 Comments

@DLackey,

Ahh, yes. I see now. So you might possibly be giving people "points" for each time they participate, etc. Okay.

Then actually, the link that Joshua provided is probably the most simple solution to your problem. Assuming you _are_ keeping some type of "point system" in place and updating the "Sponsor Points":

SELECT SponsorName, SponsorLink, Points, RAND(CAST(NEWID() AS VARBINARY)) * Points AS Weight
FROM tblSponsors
ORDER BY Weight DESC


Post Comment  |  Ask Ben

Recent Blog Comments
Nov 20, 2009 at 11:32 PM
Five Months Without Hungarian Notation And I'm Loving It
I've used headless camel case for years for not only ColdFusion variables, but also SQL tables and fields... pretty much everything involving code. I also subscribe to the "don't abbreviate and clea ... read »
Nov 20, 2009 at 11:00 PM
Five Months Without Hungarian Notation And I'm Loving It
@Marcel, Yeah, I always err on the side of longer but more readable variable names. As for the camel casing of CF methods and the headless camel casing of custom items, I get around this by always ... read »
Nov 20, 2009 at 10:56 PM
Five Months Without Hungarian Notation And I'm Loving It
I use the following and love it: my.namespace.MyComponents.functionMethodsOrUDF() CONSTANT_VALUES_OR_PROPERTIES One thing I always try is to CamelCaseBuiltInColdFusionFunctions() so others can tell ... read »
Nov 20, 2009 at 5:38 PM
Learning ColdFusion 8: CFImage Part I - Reading And Writing Images
Hi Ben, Great article. I've been looking around to see if ColdFusion image engine can programatically create the following "wrap around" effect: http://www.creativepro.com/article/photoshop-s-she ... read »
Nov 20, 2009 at 5:35 PM
Maintaining ColdFusion Sessions Across SMS Text Message Requests Without Cookies
@Dave: I talked to Gert he suggested: <cfhttp method="get" url="http://{some cf website}" result="stuff" addtoken="yes" /> Note the addition of cfhttp attribute addtoken. That should persist y ... read »
Nov 20, 2009 at 5:23 PM
Maintaining ColdFusion Sessions Across SMS Text Message Requests Without Cookies
@Todd, Ahh, gotcha, yeah that makes sense. ... read »
Nov 20, 2009 at 5:17 PM
Maintaining ColdFusion Sessions Across SMS Text Message Requests Without Cookies
Ben, sorry if I didn't make this clear. You can make it work like that if you want, just put <cfset session.foo = 1> (and <cfset application.foo = 1>) in your OnRequestStart() and it reve ... read »