Ask Ben: Selecting A Random Row From A Weighted, Filtered Record Set
This is the scenario to help explain my question so I hope it makes sense: On a form is a zip code search box where you enter a zip code to see the list of services or shops in your area (e.g. 90210). The query returns 10 businesses in that area. Now, the tricky part is that some of the business should appear at the top (or near the top) more often than not but all 10 results should always show up since certain clients paid more to be a part of the listing service (just like the sponsored links on Google/Yahoo, etc). I know Google Adwords does this but I don't know how they do it. Can anyone help me figure this out? I have found to be more challenging than I thought it would be so I thought I'd bug you. :) I have MS SQL 2000 and MS Access to play around with. I'm pretty sure this will just be getting a query written right but maybe there is more to it than that?
Selecting from a weighted set of values is not a complex thing in the abstract. In fact, if we are dealing with a static set of data, it's actually pretty easy and there a number of user defined functions that will help you with this problem. However, we're talking about selecting from a weighted datatable. This becomes much more complicated for several reasons:
The SQL is executed in the database, not the ColdFusion server, so we no longer have access to lots of great functions and procedural logic that we would have in a straightup ColdFusion solution.
The result set can / must be filtered based on other business logic (ex. zip code) before the weight of the records can even have a contextual meaning.
This definitely throws a wrench in the works. I've been thinking about this problem ever since you asked it and I couldn't come up with a good solution until I was riding up in the elevator literally 15 minutes ago. I had a sort of eureka moment: Pivot Tables (thank you Rick Osborne, you magnificent bastard!).
A pivot table is a database table that has nothing but an ID column starting at 1 (one) and incrementing once per row for a finite number of rows. So, for example, you might have a table named pivot100 and that 100 rows with ID values 1 through 100. These pivot tables can be used to help enumerate values that otherwise would require special TSQL.
We can use a pivot table to take our weighted records and, in essence, flesh them out into a larger, nonweighted record set. By that, I mean that we can convert a weight (ex. 50) into the equivalent number of nonweighted records: 50 records of no particular weight. This is hard to conceptualize, so let's look at a quick example.
Before I get into the example, however, I need to build up a database to be used in our examples. I am going to store my SQL in a ColdFusion variable to be simply executed at the top of each query:
 <!
 Define our boilerplate data for the girl table. We are
 defining this so that we can run later queries without
 too much distraction. In reality, this table would be a
 static table in our database.
 >
 <cfsavecontent variable="strSQL">

 <!
 Declare out temp table. Remember, this is going to be
 used in lieu of a real database table.
 >
 DECLARE @girl TABLE (
 id INT IDENTITY( 1, 1 ),
 name VARCHAR( 50 ),
 weight INT
 );

 <!
 Populate our temp girl table with different names and
 weights (not physical weight, the business weight 
 girls of all different physical weights are beautiful).
 >
 INSERT INTO @girl
 (
 name,
 weight
 )(
 SELECT 'Sarah', 100 UNION ALL
 SELECT 'Libby', 30 UNION ALL
 SELECT 'Lisa', 30 UNION ALL
 SELECT 'Molly', 250 UNION ALL
 SELECT 'Kit', 50
 );

 </cfsavecontent>
Here, we are building up a @girl table with girl names and different weights (business weights, not physical weights). NOTE: You won't have to deal with temp table; I am doing this only because I don't have physical tables to work with.
Ok, now that we have that, let's see what happens when we join our @girl table to a pivot table. When we join the two tables together, we are going to match rows until the ID column of the pivot table is larger than the weight of the given @girl:
 <! Join the girl to the pivot table to flatten records. >
 <cfquery name="qGirl" datasource="#REQUEST.DSN#">
 <! Build tables (to mimic database). >
 #PreserveSingleQuotes( strSQL )#

 SELECT
 g.id,
 g.name,
 g.weight,

 <! Get pivot ID for debugging. >
 ( p.id ) AS pivot_id
 FROM
 @girl g
 INNER JOIN
 pivot1000 p
 ON
 g.weight >= p.id
 </cfquery>

 <! Output records. >
 <cfdump
 var="#qGirl#"
 label="Girl (x) Pivot"
 />
When we output the result of this cross product, we get something that looks like this (I have shortened the output for ease of reading):
    
  
  
Notice that when we join the @girl table to the pivot1000 table, we have flattened the records. Now, rather than having a record with a weight of 100 (Sarah), we have 100 Sarah records (whose individual weights are no longer of consequence).
Do you see where this is going? Once we have this flattened record set, we simply need to select a random row from it. Assuming that the row selection is random, the weights of the original records will still apply since the heavier rows will have more records in our flattened table and therefore will be more likely to be picked by random selection.
When it comes to selecting a random record from a database table, the easiest method is to select the top row of a record set that has been sorted using MS SQL's NEWID(). The NEWID() method will return a new UUID (universally unique ID) for each invocation, giving each record in the table a temporary (not returned in actual record set) random UUID.
As you can see in the output above, the cross product with our pivot1000 table results in a rather large intermediary table (the record count is equal to the sum of all weights in the target table). Therefore, we really want to limit the amount of data that ends up in our intermediary table. As such, I would recommend getting only a random ID using the pivot table and NEWID() method and then using this randomly selected ID to further select the full record from our target table. This way, we don't need to use the full column data of our target table until we have already limited the recordset down to one row.
 <! Select random, weighted girl. >
 <cfquery name="qGirl" datasource="#REQUEST.DSN#">
 <! Build tables (to mimic database). >
 #PreserveSingleQuotes( strSQL )#

 <! Pick random, weighted record. >
 SELECT
 g.id,
 g.name,
 g.weight
 FROM
 @girl g
 INNER JOIN
 (

 <!
 In this inner query, we need to select a random,
 weighted ID. We are doing this in the inner query
 rather than in the outter query so that our
 intermediary table doesn't need to contain so
 much information (just the ID).
 >
 SELECT TOP 1
 g.id
 FROM
 @girl g
 INNER JOIN
 pivot1000 p
 ON
 (
 <! Use the weights. >
 g.weight >= p.id

 <!
 Use any additional filtering that is
 required by the business logic of the
 query criteria.
 >
 AND
 g.name != 'Lisa'
 )
 ORDER BY
 <! Select random row. >
 NEWID() ASC

 ) AS temp_id
 ON
 g.id = temp_id.id
 </cfquery>

 <! Output random row. >
 <cfdump
 var="#qGirl#"
 label="Random Girl"
 />
By selecting the random ID as a subquery, we don't need to move around full column data until we are pulling down the one girl record. In the subquery, I am filtering the target table (excluding the girl named Lisa) merely to demonstrate where additional business logic filtering would go.
When we run the above code we get this output (which would be different each time):
    
  
  
OK, you say  I see that one girl. But, does this really select random girls by weight?
Let's put it to the test: let's run this a bunch of times and see what we get back:
 <! Run the test many times. >
 <cfloop
 index="intTestIndex"
 from="1"
 to="30"
 step="1">


 <! Select random, weighted girl. >
 <cfquery name="qGirl" datasource="#REQUEST.DSN#">
 <! Build tables (to mimic database). >
 #PreserveSingleQuotes( strSQL )#

 <! Pick random, weighted record. >
 SELECT
 g.id,
 g.name,
 g.weight
 FROM
 @girl g
 INNER JOIN
 (

 <! Select random ID. >
 SELECT TOP 1
 g.id
 FROM
 @girl g
 INNER JOIN
 pivot1000 p
 ON
 <! Use the weights. >
 g.weight >= p.id
 ORDER BY
 <! Select random row. >
 NEWID() ASC

 ) AS temp_id
 ON
 g.id = temp_id.id
 </cfquery>


 <! Output randomly selected girl. >
 #intTestIndex#) #qGirl.name#<br />

 </cfloop>
When we run this (simplified query) 30 times, we get the following output:
1) Molly
2) Molly
3) Libby
4) Sarah
5) Molly
6) Libby
7) Libby
8) Molly
9) Sarah
10) Sarah
11) Kit
12) Molly
13) Kit
14) Lisa
15) Molly
16) Sarah
17) Molly
18) Molly
19) Lisa
20) Molly
21) Molly
22) Lisa
23) Sarah
24) Lisa
25) Lisa
26) Molly
27) Molly
28) Molly
29) Sarah
30) Molly
In this set, we get the following tally:
Sarah: 6
Libby: 3
Lisa: 5
Molly: 14
Kit: 2
If you look back at the original weights of the girls, you will see that this is more or less accurate. Molly, who had the largest weight is selected by far the most times. Sarah, who has the next highest weight and just under half that of Molly is selected just less than half the number of times Molly was selected. As we get down into smaller weights, the selection is less accurate; however, over a larger number of selections (N), I am sure you'll find that the statistics become much more accurate.
So, what was at first a seemingly complicated problem is actually something that we were able to boil down into a single query. The syntax of this will change for each database engine, but I believe that this should be possible in all professional database management systems. I hope that this has given you some good inspiration.
Looking For A New Job?
 1 Senior Javascript Developer + 1 Javascript intern (paid) at Smartling
 IT Manager  ColdFusion/SQL web application developer at ARELLO
 DevOps Engineer at Linode
 Software Engineer at Linode
 Web Applications Developer (Remote / Telecommute) at SiteVision, Inc.
Reader Comments
An interesting approach. I just did some searches to see what others would do, and found this post. Basically take random number (01) multiply it by the weight and sort by that.
http://www.improve.dk/blog/2007/11/25/weightedrandomselectionsinsqlserver
Great info. I could have used this in the past!
Why would you want to select the women with the most weight first? :P
@Josh,
Oooh, clever. I have seen many examples of people trying to use RAND() but as it only fires once per query it is usually ineffective. I've never actually seen an example (that I can recall) where someone used a NEWID() to seed RAND(). Very cool!
@Ben,
I think your solution for attacking this problem is pretty slick. However, when I reread the submitted question, a few questions of my own came up and thought I'd just throw a my thoughts into the mix.
If I understand the question correctly, all businesses "pay" to be listed. Some simply pay more than others ... so I'm assuming there's some sort of "sponsorship/subscription levels" in place. Just for simplicity, let's go with the everpopular "Bronze, Silver and Gold." Where Bronze.Price < Silver.Price < Gold.Price. There could be many more, but I'm attempting to keep this simple.
I have to make an assumption that the "levels" are the "weight" by which I want to order my listings by. For example, Gold sponsors would be at the top of the list, then Silver, and lastly Bronze.
So, if I were building the database, I would most likely have a Sponsors/Businesses table ... so let's call it tblSponsors. In addition, I would probably have a table for the various sponsor levels ... let's call it tblLevels. Now, tblLevels could have a field called "LevelWeight." Let's say that Gold.LevelWeigth = 1, Silver.LevelWeight = 2 and Bronze.LevelWeight = 3.
Now, a simple query should bring up the desired results. For example,
SELECT SponsorName, SponsorLink, LevelWeight
FROM tblSponsors
INNER JOIN tblLevels ON tblSponsors.LevelID = tblLevels.LevelID
WHERE {whatever the selection criteria may be}
ORDER BY LevelWeight
This may be oversimplistic, but sometimes, that's how I like to approach things ... even though for every complicated question, there is a simple, elegant and "wrong" answer.
Oh, and by the way ... yes, I understand my solution isn't taking into account "Randomness" ... but I didn't see where that was a requirement from the person that submitted the original question.
@Steve,
It all depends what you are going for. You could even integrate the Sponsor levels table into the random, weighted query as well. Just depends on what you need.
@Josh
I, too, had tried the rand() function but came up with effectively an ineffective solution. The newid() is a great touch. I was wondering if randrange(1,10) would work but I haven't tried it yet. Thanks for the link, you approach is short and simple.
@Ben,
Your details were excellent and I actually now have an understanding of pivot tables whereas before, I had none. What a very cleaver approach to solving the issue I had. You mentioned with a larger data set the results would be more accurate. With the scenario I presented, I'm only dealing with about 10 records for a given zip code so would the sort still be random enough (enough may be relative though)? I will try both approaches and see which one satisfies the requirements best. Thanks for your work, Ben, and I'm glad I challened you. :)
@Steve,
The sponsorship is the right idea, but to be fair to all, someone who pays the least should still have a "chance" at being the first on the list of results. All the companies will be listed for the zip code, it just depends on placement. I can't do by name or people will start naming their companies with a bunch of AAAAAA just to be listed first. :)
@DLackey,
When I referred to (N) and the relationship to accuracy, I was not referring to the number of records from which you were selecting  I was referring to the number of times you actually make a random selection. For example, 100 random selections over an hour will be less "statistically accurate" than 2,400 random selections made of the course of an entire day.
The number of records to choose from should be irrelevant.
I'm glad that you are liking pivot tables more. They can be a super awesome tool, especially when working with dates.
@Ben,
Regarding the (N) and the relationship to accuracy, thanks for clarifying.
@DLackey,
Any time my man.
Good post Ben. It's a use for Pivot tables that I haven't thought of.
Just to add to Ben's discussion, if you're running SQL 2005 or higher, you can use the builtin pivot function to bypass some of the temp table logic.
You can check out the function and some examples here:
http://technet.microsoft.com/enus/library/ms177410(SQL.90).aspx
Jimmy
Nice technical solutions.
What I've always found interesting for these things is more the business logic.
When a site is new, undergoing growth, and has small number of advertisers, the number of views each advertiser get is also growing.
Eventually user growth slows or plateaus (there are only so many people interested in requesting these location specific queries)
At that point if the number of advertisers increases, the number of views of their advertisements will decrease no matter what their ranking is.
This has hurt other small sites with this advertising model because the advertisers see decreasing returns on their investment as their number of views decrease.
Massive sites like Google can statistically bury these changes, and have also compensated for this by getting us to put their client ads in our sites, providing additional advertising slots beyond their own site to simulate user growth
@Steve,
That's an interesting thought. I guess that's why a lot of sites have timebased viewing rather than weighted viewing. By that I mean, you pay X dollars to have your site visible ALWAYS for a given period of time. That way, you are not competing with other advertisers  you are simply paying for real estate.
@DLackey,
To be honest, I'm having trouble with the "randomness" philosophy as applied here. If I paid for a "Gold" sponsorship level, then I would be pretty upset if someone else only paid for "Bronze" and rose above me. You used Google AdWords as an example in your original question ... well isn't that based on bidding ... the best spot for the highest bidder?
It looks like this post is about to go off in a tangent so I'll provide a little bit more insight to keep the focus on "how to display results with some weighted randomness." When I originally posted the question and since I was referencing Google's Adwords, I thought the question could be made clearer if I kept the reference to money instead of making it over complicated by adding in the other factors (e.g. membership duration, participation in local charity events, and attendance in the monthly meetings) requiring the weighted randomness. The site is for a small networking group who all cater to the same service industry but limited coverage area (zip codes). Everyone in the networking group actually paid the same membership dues (to cover the cost of hosting of the website). The reason behind wanting the weighted randomness is because some members A) have been in the organization longer and B) participate more in the community by attending activities such as charity events and meetings while other members do the bare minimum to satisfy the membership requirements.
Hope that helps. :)
@DLackey,
I've worked on add systems before that have weighted randomness based on how much they pay. I think that's legitimate. I think if you start to think in terms of classifications like Gold, Bronze, you forget that we're really just talking about monetary amounts.
@DLackey,
Ahh, yes. I see now. So you might possibly be giving people "points" for each time they participate, etc. Okay.
Then actually, the link that Joshua provided is probably the most simple solution to your problem. Assuming you _are_ keeping some type of "point system" in place and updating the "Sponsor Points":
SELECT SponsorName, SponsorLink, Points, RAND(CAST(NEWID() AS VARBINARY)) * Points AS Weight
FROM tblSponsors
ORDER BY Weight DESC
Guys  just wanted to say  even after all of these years  that (a) this site rocks, and (b) the solutions provided here are insightful and wonderful ... they're exactly what I was looking for. Aside form that, they're *sexy* to get my head around.
@Alyssa,
Ha ha, glad this still has some value a few years later :)
@Ben, dude ... you've got such an incredible thing going here. Love the site, love the way you have it set up, love the autoemails I get, love the kinky points. Don't know what to do with them, just want to horde lots of them.
 PART 1 
Ben, I have very important information that proves your formula is actually the most accurate formula available right now. Some people in this blog have mentioned Mark Rasmussens formula he posts using NewID() and his formula is much simpler and faster. BUT, it is totally wrong. He and I have been going back and forth for a few weeks now and I finally posted what he would not post, actual results. He instead focused on the the NewID() being a good use of randomizing but he would never finish out his formula with his own Weight or Points in his example. So, I did if for him.
Bad news for him and good for you. I even direct people in my response back to this article. Here is what I found out. Very long but very good reading with actual numbers:
( I do have 1 request. Your example was written for Cold Fusion and I am no SQL Server guy like Mark is below and I do everything through old VB Script using standard SQL lanquage. There are tons of people just like me looking for the base SQL Code and could you convert this and remove the Cold Fusion parts and strip it down the basic SQL anyone can run on any web programming lanquage they use to connect to SQL Server????????)
 IMPORTANT 
Mark, I have been busy so I could not respond to your last comment. However, I finally had to time to run your script all the way through and I have BAD news. It does NOT work just like I suggested. I now have the SQL results to back up my claim and the numbers clearly show the formula is completely wrong.
Now, your NewID() calculations ARE correct and there are plenty of other sites that have the same identical breakdowns of the NewID() distribution but that is the only part of the formula that works.
So everyone else understands lets first go over the EXPECTED numbers of what a "random weighted sample" should be.
Weighted Numbers by their definition is the (weight assigned for 1 item) / (total of all weights for all items). That is a weighted number. That's a division calculation. That is the weight of EACH item compared to the sum of all the items.
So, using Marks own numbers we have the following data:
Peter (weight = 17)
Mark (weight = 25)
Kirsten (weight = 33)
Jakob (weight = 12)
Mads (weight = 4)
Anders (weight = 0)
So, if you add up all the weights or points you get a total weight of 91.
So now we can get our ACTUAL EXPECTED WEIGHTING for each person. If you were to do this mathematically for any business critical application these are the numbers YOU and YOUR BOSS and YOUR CUSTOMERS would expect. Anyone can calculate these numbers so a customer paying you would expect to see these numbers.
Peter: 17/91 = 18.7%
Mark: 25/91 = 27.5%
Kirsten: 33/91 = 36.3%
Jacob: 12/91 = 13.2%
Mads: 4/91 = 4.4%
Anders: 0/01 = 0%
THESE ARE THE ACTUAL NUMBERS AND ANY FORMULA SHOULD BE +/ 1% of these numbers in large batch runs. Add up all those %'s and you get 100% as a check.
Now, the bad news for Mark's Formula  It doesn't work. The numbers, just as I had mentioned above in previous posts will produce wrong numbers  REGARDLESS whether we use the NewID() function. In fact the NewID() is actually part of the problem as I will explain below.
 PART 3 
Here are the actual numbers for both a batch run of 1,000 and 10,000. Large enough to prove the formula is wrong.
FOR THE BATCH RUN OF 1,000 HERE ARE THE NUMBERS:
Peter: 103 which equals 10.3%  should be 18.7%
Mark: 296 which equals 29.6%  should be 27.5%
Kirsten: 564 which equals 56.4%  should be 36.3%
Jakob: 37 which equals 3.7%  should be 13.2%
Mads: 0 which equals 0%  should be 4.4%
Anders: 0 all equal 0% which is correct, never pulled it
 OVERVIEW: if you total the numbers they equal to 1000
BOTTOM LINE: results not even close to expected values
FOR THE BATCH RUN OF 10,000 HERE ARE THE NUMBERS:
Peter: 1106 which equals 11.06%  should be 18.7%
Mark: 3154 which equals 31.54%  should be 27.5%
Kirsten: 5451 which equals 54.51%  should be 36.3%
Jakob: 282 which equals 2.82%  should be 13.2%
Mads: 6 which equals 0.6%  should be 4.4%
Anders: 0 all equal 0% which is correct, never pulled it
 OVERVIEW: if you total the numbers they equal to 10,000
BOTTOM LINE: numbers for 1k are almost identical to 10k which means the formula is running a nearly identical result pattern but both are WAY TO FAR OFF from the expected results. That proves the formula is wrong.
Why I believe the formula is wrong. The NewID() is not a stepping sequential increment. The numbers are all over the place. To properly, in my opinion, do a seeding you would have to do a 1 to x increment and then ONLY seed by those numbers. If you have 1,000 then you should be seeding ONLY with the numbers 1 to 1000 randomly multiplied to each rows WEIGHT or POINTS then take the highest scoring 1 and that is your winner. (See the link below as this other person does something similar to this and his formula is the most accurate you can get) In doing it this way you would not get wide ranging way out of wack calculations causing the % error that I mentioned in my other posts above. This is the flaw with using NewID() is that the % error using it is huge. JUST AS I SAID IN EARLIER POSTS the WEIGHTS or POINTS in the middle are somewhat closer to being accurate but lower numbers and higher numbers are horribly way off. The ONLY time the NewID() would be accurate is if you seed using it on a very, very, very large sample at the upper limits of how many NewID() numbers can be assigned. Then it would be as close to 1 to x as possible. This is important to understand. if the query you are doing is seeding 10,000 rows but NewID() can do 100,000,000,000,000 or what ever number that is you can see you are multiplying some rows by numbers that are so huge and others that are so small causing very large % errors. so instead of being seeded by 1 to 10,000 it is being seeded in the end by 1 to 100,000,000,000,000.......not very accurate.
There is probably a fix to this. However, I found another person who did it a different way, still using New(ID) but he took it a step further and his results are the most accurate you can get. VERY ACCURATE. What's crazy is that people on his blog point back to your calculation as a better, simpler way, only to find out now it is not accurate. The link to that other formula and explanation is:
http://www.bennadel.com/blog/1472AskBenSelectingARandomRowFromAWeightedFilteredRecordSet.htm
In a nutshell what he is doing is taking weight and using a pivot table to enter in the person x times for every WEIGHT OR POINT. Basically Peter would be entered in 17 times, Mark 25 times, etc. In the end you get a pivot table that enters everyone in the total amount of times of their WEIGHT or POINT. THEN he sorts by the New(ID). Far more accurate. There is NO multiplication and there is NO RANDOM being used. Now you are simply picking out a number from the hat in a raffle. The more points you have the more numbers you have in the hat and your chances increases. THE MOST ACCURATE WAY TO DO THIS.....
 PART 4 
Now, to make this formula even worse, lets take a more real world example. Lets say this was for an ad banner system. Customers like Kirsten and Jakob want to pay the same amount for just 1 banner but break up that same amount of exposure into multiple banners. Kirsten wants to show 3 banners instead of everyone elses 1 banner. To do this you would simply divide up her standard WEIGHT or POINTS by 3 for the 3 banners. Thus making here be entered into the table 3 times:
Kirsten1: (33 Points / 3) = 11 points for the 1st banner
Kirsten2: (33 Points / 3) = 11 points for the 2nd banner
Kirsten3: (33 Points / 3) = 11 points for the 3rd banner
(the numbers could be divided any way but equal back to her original 33 points. Could be 15, 10, 8 for example)
Jakob wants 2 banners and gets entered in 2 times:
Jakob1: (12 Points / 2) = 6 points for the 1st banner
Jakob2: (12 Points / 2) = 6 points for the 2nd banner
Now, IF YOUR FORMULA WAS CORRECT this would NOT matter that they broke up their weights into multiple weights that equal the main original weight. If the formula worked correctly then each of the individual banners would add together and equal what 1 banner would equal. Your formula does NOT do this.
In fact, as we proved, smaller numbers are way too far away and inaccurate from their expected value. So, using this formula if Kirsten and Jakob were to convert from 1 banner to more than 1 banner this would actually punish them because the WEIGHT or POINTS would drop and the formula punishes them because it treats lower WEIGHTS or POINTS worse off and the actual query from the database shows they will not get enough views.
I dont want to show the results, it looks really bad on this formula and makes things even worse.
 LAST PART 
the above was disproving Mark Rusmussens formula that some people in this blog article refer back to as possibly a better method than the original author Ben. However, when someone doesn't run a full test then other people take the formula as proven. It sadly was NOT.
No telling how many people have read his post and have implemented his formula only to be getting wrong results and they dont even realize it.
Imagine using Mark's formula if a bank or a casino dealing with large amounts of money. At some point someone would lose a ton of money.
The reason for all of these and the detail is to show that the ORIGINAL formula from BEN who is the author is actually GENIUS and the proper way to do it.
If you were holding a raffle or lottery someone buys more tickets to increase their odds. *** BENS formula is the ONLY one published on the web that actually does this same thing. Each person's weight is effectively them buying themselves more lottery tickets increasing their odds but only as far as how many others have done the same thing.
I believe there is a simpler way to do this formula without the Pivot table by simply doing a count for each persons based on their weight ( so 1 to x where x is their weight) then summing up all of these weights. YOu can then still ORDER BY the NewID() only as long as you never multiply anything by NewID() just like Ben does in his formula. Using NewID there is OK since all we are looking for is a unique number for every row and sorting by it then taking the top 1 for the winner. Genius on his part. I would like to see if there is to simplify this down even further into a shorter formula......
I think it's pretty bad form to crosspost a comment as we'll now have to continue the discussion on two places to keep the thread complete. You should've just referenced your comment instead, imho.
Anyways, my comment:
You're wrong, but right at the same time.
There is a major flaw in my sample that renders it useless. However, neither RAND() nor the use of NEWID() has anything to do with the error. Instead, it's my most naive algorithm that's incorrect, a fact I didn't realize untill testing my sample through rigorously due to your comments.
For a correct weighted random implementation, see Dems' answer at Stack Overflow: http://stackoverflow.com/questions/58457/randomweightedchoiceintsql/454454#454454
@WalkOffHomerun, @Mark,
That was a lot of information to take in :) It took me a little bit to follow the conversation (as I was not always aware which voice I was reading in).
@Mark,
Thanks for Stack Overflow link  though I don't think I've ever executed a WHILE loop inside a SQL query before. I've been doing a lot of MySQL the last few years and I'm not even sure if such constructs are available in the MySQL engine (outside of stored procedures).
In any sense, however, it's some juicy food for thought!
Ben, your original method is simply genius because it really gets as close as you can get to a true random weighted selection. There are however alot of people who might not be Cold Fusion developers or might use other databases or other scripting languages and i have been trying to port over your example using Cold Fusion language using MS SQL Server to simple SQL language anyone can use. this formula could be used for any scripting language or development software so a conversion from you to simple sql language would be appreciated. many people like myself are not hard core database guys so we scour the web looking for simple examples we can use on our own. I personally still do most of my stuff with good old VB Script and I am trying to learn more about SQL joins and pivot tables and how they work. so i am still trying to convert your method to work for my application with SQL Server and VB Script......
@WalkOffHomerun,
Well, I certainly appreciate that :) And, Pivot tables are really cool! When someone showed me that concept, it kind of blew my mind.
vow...nice man... it worked
Post A Comment