Ben Nadel
On User Experience (UX) Design, JavaScript, ColdFusion, Node.js, Life, and Love.
I am the chief technical officer at InVision App, Inc - a prototyping and collaboration platform for designers, built by designers. I also rock out in JavaScript and ColdFusion 24x7.
Meanwhile on Twitter
Loading latest tweet...
Ben Nadel at cf.Objective() 2010 (Minneapolis, MN) with:

Create A Running Average Without Storing Individual Values

By Ben Nadel on
Tags: ColdFusion

I was thinking about how to be answer a new ColdFusion-based "Ask Ben" question about a rating system when I thought about creating numeric averages. All my life, when creating an average, I followed the simple formula of dividing the sum of a collection by the number of its entries:

Average = Sum / N

This, of course, requires you to have both the sum and the count of a collection of values. But what if we wanted to keep track of a set average as the set of values changed over time (such as the ratings in a rating system). Unless I was storing user-specific data with each individual rating, I wouldn't want to have to store each rating individually. So this got me thinking about weighted averages. I wondered, would it be possible to augment an average simply by averaging a new value into the average in a weighted manner. Meaning, given an average value based on N numbers, could I simply average in a new number by giving the current average a weight of N and the new number a weight of 1?

I am sure that anyone with a decent background in math is reading this and saying, "D'uh," but to me, this question was not immediately obvious. As such, I ran some tests to see if this worked:

  • <!---
  • Create an array in which to hold our original set or
  • random numbers (for which we will be finding an average).
  • --->
  • <cfset randomNumbers = [] />
  •  
  •  
  • <!---
  • Now, let's create some random numbers to store in the array.
  • We're going to keep the number relatively small since a small
  • set will be influenced more per each numbers.
  • --->
  • <cfloop
  • index="index"
  • from="1"
  • to="10"
  • step="1">
  •  
  • <!--- Add a new, random number to the collection. --->
  • <cfset arrayAppend(
  • randomNumbers,
  • randRange( 1, 10 )
  • ) />
  •  
  • </cfloop>
  •  
  •  
  • <!---
  • At this point, we have our number collection. Let's figure
  • out the average of this collection, before we add anything
  • new. (NOTE: getting count for use later).
  • --->
  •  
  • <!--- Get the number of random numbers we created. --->
  • <cfset randomCount = arrayLen( randomNumbers ) />
  •  
  • <!--- Get the current sum of the collection. --->
  • <cfset baseAverage = arrayAvg( randomNumbers ) />
  •  
  •  
  • <!---
  • Now, let's create a new random number that we want to use
  • to update our base average.
  • --->
  • <cfset newNumber = randRange( 1, 10 ) />
  •  
  •  
  • <!---
  • At this point, we have two options:
  •  
  • 1. We can add the new number to the existing collection and
  • then take a new average.
  •  
  • 2. We can take the existing average of the collection and
  • then combine it with the new number in a *weighted* fashion
  • such that we only need the average and the count.
  • --->
  •  
  •  
  • <!--- Method 1: Add number to the collection. --->
  • <cfset arrayAppend( randomNumbers, newNumber ) />
  •  
  • Method 1 Average: #arrayAvg( randomNumbers )#<br />
  •  
  •  
  • <!--- Method 2: Create weighted average based on count. --->
  • <cfset newAverage = (
  • ((baseAverage * randomCount) + newNumber) /
  • (randomCount + 1)
  • ) />
  •  
  • Method 2 Average: #newAverage#<br />

As you can see in the code, I am creating a collection of 10 random numbers and then an 11th random number. To figure out how the 11th number changes the average of the first 10, I try two different methods:

  1. Simply adding the 11th number to the existing collection and dividing by 11.
  2. Multiplying the average of the first 10 by 10 and then adding the 11th (creating a weighted sum) and then dividing by 11.

I ran this code a few times to make sure nothing happened by coincidence:

Method 1 Average: 6.27272727273
Method 2 Average: 6.27272727273

Method 1 Average: 6
Method 2 Average: 6

Method 1 Average: 5.27272727273
Method 2 Average: 5.27272727273

Method 1 Average: 3.81818181818
Method 2 Average: 3.81818181818

Method 1 Average: 6.45454545455
Method 2 Average: 6.45454545455

As you can see, after several runs, both average-augmentation methods come up with the same value each time. This is really awesome (and again, I'm sure obvious to the more Mathletic among you)! What this means is that if we need to grow an average over time, we don't actually need to store the individual values - we only need to store the set size and the set average at any given time. This should make my new "Ask Ben" answer a bit more straightforward!



Reader Comments

You can easily prove this by applying some calculus that even I can understand: ;)

Given
a) M = Sum / N;

If you add the value X to the collection, then
b) Sum = Sum + X;
c) N = N + 1

now fill in b) and c) into a) :

M = (Sum + X) / (N + 1)

But it is of course always good to check these things with some concrete implementation, to see if you really understand what is going on

Reply to this Comment

Whether or not it's obvious, I think it's good to post things like this. There are a surprising number of little math tricks that are handy to know, and you can save someone who doesn't know one a lot of time by sharing it.

This one I knew ... math is my strong point, plus with a gaming/sports background, I got into all kinds of this stuff growing up (batting average/winning percentage, ERA, etc.), and then working for a market research company, we got into more little tricks like flipping a scale: subtract a response from the sum of the top and bottom values to get the opposite value. (answers 1 to 5, a 2 becomes a 4: (5+1)-2)

Reply to this Comment

Mmmm... looks like I missed a part in my above comment.

if
a) M = Sum / N; then
d) Sum = M * N; and
e) N = Sum / M

As I mentioned when you add a number X to the collection, then
b) Sum = Sum + X; and
c) N = N + 1

By inserting d) into b) you get
f) Sum = (M * N) + X

If you insert f) and c) into a) you get
g) M = (M * N) + X / (N + 1)

Reply to this Comment

This is how I always want to implement average rating kind of features (with just two fields) but I usually have the requirement for ensuring a user can only add a rating once, hence resorting to the usual "store each value with a userID in an intermediate table" approach :P

Reply to this Comment

@Phillip,

I like to think of myself as average, but only in a minority of situations :)

@Jim,

Ah, good to know this has a name.

@Martijn,

I'm just going to nod and say, Yes.

@Dave,

Most agreed. Plus, when you are solving problems, I think often times we don't always concentrate on any given aspect of it in which we might find efficiencies. All to say, it's good to think about things that might be obvious simply because they might be overlooked in different contexts as good solutions.

Reply to this Comment

@Justin,

Ahhh, good point! I lost sight of that as I was thinking about weighted averages, but you are absolutely, 100% correct.

Reply to this Comment

@Dave Totally agree that tricks like this one by Ben and the one you just mentioned are incredibly handy sometimes. Do you know of any reference page where they collect things like these?

Reply to this Comment

@Martijn, I don't know about a page like that ... but then I'm pretty good at setting them up myself, so I don't usually look. (If it's for work, I might look for a CFC, but at home I typically want to do it myself.)

It seems like there should be a language-specific collection somewhere, though.

Reply to this Comment

Just thought I'd add a bit to the discussion :)

The only caveat with this approach is that it
becomes less accurate over time because you're storing the results with limited decimal precision. It's espcially subject to variation if you use floats. Don't get me wrong - this approach is completely fine for calculations that don't require pin point precision, but maybe not for things that require accurate fractional values beyond a couple of decimal places, like financial or scientific numer crunching.

With floats, the margin of error increases fairly quickly, since every single time you add a number into the average, the lack of precision in the float data type gets factored into your calcuation as well. With decimals, the level at which your precision starts to suffer is dictated by the number of decimal places you're using. For example, if you're only storing 2 decimal places, the precision starts to get bad after a measly 100 values.

Again though, whether or not these margins of error are OK depends entirely on what kind of software you're writing.

Reply to this Comment

Nice dodge! I dig the idea to force out the last bit of performance out of a chunk of code, even though it's such a minor thing.
Heard of this kinda approach in connection with "running sums".

@Roland Collins Yet, this approach still "looks" quite numerical stable (see testruns). I think the usage of floating point arithmetic in general is the caveat rather than this very approach.

Reply to this Comment

@Martin - not just floating point though. Depending on what langauge you're working in, decimals can cause just as many headaches if they're not precise enough.

But again, for most applications, this approach works great. I'm just pointing out that there are several classes of problems for which the margin of error in this calculation is unacceptable.

Reply to this Comment

If I wanted to do this and only carry two numbers, I'd keep track of the sum and N. Then you are pretty much accurate all the time.

average = (sum + new_number) / (N + 1)

But all this was in a former life as an analyst at DIA.

Reply to this Comment

@Ben, I think you're going about this the wrong way. You're trying to use complicated techniques when there is a simple and beautiful technique readily available (a la Gary Funk's comment).

Instead of tracking the current average and the current count, and from it reverse-engineering the sum, you should keep track of the current sum and the current count, and straightforwardly calculate the average.

In the question of how to calculate new averages from old averages, obviously the difference in complexity is minimal. But reverse-engineering the old input from the old result in order to compute a new result can get very complicated very quickly, when the computations get even just a little harder. So why not make it a practice always to keep the flow of computation always going from input to output?

Reply to this Comment

@Martijn van der Woud

Exactly... this is probably the first questions they asked us when we very first started programming. And that's the solution we came up with. Quick and simple.

Reply to this Comment

@Roland,

Good point regarding the precision over time. I hadn't thought of that as float precision has not been something that has bitten me before.

@Gary,

Oh man, that makes a lot of sense. That never even occurred to me, but you are absolutely correct - that is much better than my approach.

@Justice,

Yeah, Gary's ideas is really good. But as far as reverse engineering, I wouldn't say that I am doing that - there would be no way to reverse engineer the values; I'm simply creating a new, weighted average.

Reply to this Comment

::: WARNING - TAKE CAUTION IN READING THIS REPLY IT MIGHT MAKE YOUR HEAD HURT - OR NOT :::

Something to consider. Assume you have a website where you did in fact want to see how each of your users voted on a certain poll or question but still want to be able to calculate the averages. It is really simple when you see it in action, and once you start playing around with the idea the sky is the limit...

Okay Assume this is your database table ratings or whatever

POLLS, RATINGS or WHATEVER

ID | RATING | VOTERS
::::::::::::::::::::::::::::::::::::::::
1 | 4,3,2,5,1,3,4,3 | 1,8,2,4,5,3,7,9
----------------------------------------

Now of course you would query your database to get the voters and there ratings.

Once you have retrieved the data you can store the data into variables of your choosing.

For Example

<cfset myRatings = "#queryName.ratings#">
<cfset myVoters = "#queryName.voters#">

Okay Now to Continue...

From this we would be okay so how do we count how many votes there was?

Well ColdFusion has a beautiful function called listLen it will tell you how long your list ( number of voters) is which you will use later on to calculate your average.

<cfset myRatings_len = listLen(myRatings) >

After you have your list length you can do a list loop in which you add each value together. Once you have each value totaled you can then do your math by having your total divided by the length of the list.

Now every time you have a new voter you can just grab the list and then do a listPrepend which would add the voters to the front of the list allowing you to see who voted, and in which order. You could even have another field that has the time of the vote, or ip to help prevent double voting.

If someone want to see a working example just let me know and I will have the source code ready and what knot.

* To learn more about list and other fun stuff go to

http://www.quackit.com/coldfusion/tutorial/coldfusion_lists.cfm

Reply to this Comment

@Jody: From a ColdFusion perspective, that's a reasonable approach to storing both values and users, but it's probably more useful for smaller datasets. With larger datasets, you may run into problems storing the lists, depending on the database you're using.

From a database perspective, if you want to keep the individual values, it is probably better practice to store the records separately rather than combining them into a single record. By doing so, you're able to use the native database functions for mean (average) and counts, as well as for more advanced metrics if you have a need for them. It is also considerably easier to modify the dataset or to store additional information if necessary when each vote is its own row in the table.

Reply to this Comment

Also, just as an FYI, you don't need to wrap your variables in quotes and pound signs. This will work jsut fine:

cfset myRatings = queryName.ratings
cfset myVoters = queryName.voters

In fact, when you do wrap them, you add probably add overhead in this case due to string conversions.

Reply to this Comment

It's okay to disagree its human nature. But, I don't think so. I do believe it is better to stay consistent. Can you please give me one example where storing a variable in that matter ruined the application, or slowed it down? I bet you can't... I can't think of one time it gave me an error.

"However, surrounding all attribute values in quotation marks is more consistent with HTML coding style." <--- quote from adobe site...

So my logic for doing so is correct.

but it does say its not nesscary to use the pound signs... ( in which I already know) it said it wasn't nesscary not that you shouldn't do it.

Reply to this Comment

@Jody.

I am sure that a bad technique for storing data will not give you an error if you also write corresponding code for storing and retrieving the data. But it does make many things far more difficult. For example, you can no longer ask the database to perform averages or standard deviations on the data (you will want this if you are tracking votes). It is far more difficult to add a column to the votes. In fact, what you are doing is inventing your own naive storage mechanism based on the database system, which means now you need to implement new techniques or conventions for all of the basic operations on data. The alternative, of course, is simply to create and use a table of votes, with one row per vote.

It is not better to stay consistent for the purpose of staying consistent, when that means you start doing the wrong things or you start doing things that make no sense. In this case, of course, that is string-interpolating expressions when there is absolutely no need to. Doing so wrong in general. It happens to work sometimes because all simple data types - booleans, numerics, strings, dates and times - are stored in ColdFusion as strings anyway (if they were not stored as strings, string-interpolated numeric expressions would get you a string representation of a numeric when what you wanted would have been a numeric). But it will not work with complex data types - arrays, structs, queries, components, java objects, com objects, net objects, etc.

Reply to this Comment

You reduce the purpose of databases to absurdity by storing lists in a single field. As already stressed by Dave it's better (for the performance) to "roll out" the lists on to a convenient scheme. Database management systems are superfast. Apart from totaling and maximization, it's possible to use regular expressions in the queries. It's always faster than iterating over the data "by hand".

Reply to this Comment

How is it a bad technique? What if I don't want to use the database functionality? What if I think the database functionality blows? What if I want to reduce the amount of rows of data I have? Are you saying that is a bad method? I have written some pretty robust systems ( in php ) for a few companies using this same technique. Not only did I increase performance but the maintenance went down tremendously.

Reply to this Comment

If you don't want to use the database functionality, then don't use a database ;)

I can't comment on what you've done in the past, and if your solution met the needs of the problem you were trying to solve, then that's great. I wouldn't ever presume that the work you did wasn't solid. As long as you're meeting the needs of your client or company, then more power to you.

We're just trying to point out that certain aspects of the way you solved the problem are not necessarily the best approach under usual circumstances. Without having seen the system that you were working on, it's really hard to provide any valid feedback. All we're saying is that, in general, the approach you have outlined doesn't really adhere to best practices. Again - this is not a condemnation of your work. We're merely trying to point out that, in general, based on what little we know of what you've worked on, the approach you took would not be the preferred course.

Reply to this Comment

@Roland

I generally only like to use databases to store and retrieve information. I do not like to use the database to do my math for me. Why? I have more precision if I do it own my own. Also, to call my approach "Naive" is a little bit rude. I can clearly tell that you are a very negative person who always has a complaint even from your first post on this particular blog. Criticism is okay but to ridicule an approach especially when the approach works is just mean -- yeah I said it MEAN! -- ( ha ha ). You are going to have to remember something Roland and that theres is always more than one way to do a job. I know many programmers that would tell you that using your own script, instead of using the database to calculate totals is far more effective and precise. But hey what do they know they only work for a few of the top 500 companies in America. So bottom line don't say anything if the approach works, everyone is different some people are confined to there little box, while others are outside the box looking for more effective ways to optimize and store, calculate and manipulate data. You can think of it like a car drive, surely if you take the highway you will get there easier, but if you take the back roads you generally would end up getting there faster. -- If you don't know what that means you should take Philosophy.

Well enough said, I have work to do.

Jody Fitzptrick

Reply to this Comment

@Martijn,

Agreed. I think we all love lists and we all love databases... so, let's just move on.

Reply to this Comment

Sorry Ben - last post, I promise.

Jody, at no point did I call your solution naive, nor did I riducule you. In fact, I clearly stated that if you're getting the job done, then that's great.

I'm sorry if you took offense to anything. My only intent was to suggest that there are other ways of tackling the problem. Again, my apologies if I offended you.

Reply to this Comment

It's all good, I took no offense... It's the internet, If I took everything to heart that I read on the internet I would be depressed for ever.

Reply to this Comment

nice but a the need to continually increment the sample count will/maybe cause counter overflows etc not good in spaceships etc ... I started with your method however I arrived at this one ....

loop: sum for n samples--do this for n samples
store sum in sum1
average = sum1/n
sum1 = sum1/2 --initialise sum1 to half sum1
goto loop --keep going forever

the disadavantage is that this method only updates the average every n samples.

Reply to this Comment

correction .. sum for n/2 samples where n is the no of samples to be averaged.

loop: sum for n/2 samples --do this for n/2 samples
store sum in sum1
average = sum1/n
sum1 = sum1/2 --initialise sum1 to half sum1
goto loop --keep going forever

Reply to this Comment

If you are afraid to overflow Sum, here's an equivalent formula:

M_0 = 0
M_{i+1} = M_i + (x_{i+1} - M_i)/(i+1)

In pseudo-code:

n = M = 0
...
n = n + 1
M = M + (x-M)/n

Here M is the (cumulative moving) average, x is the new value in the
sequence, n is the count of values.

The downside of this method is that when averaging a huge sequence of
values you may loose precision over time because each time you get an
imprecise value for (the floating-point) M.

Reply to this Comment

Post A Comment

You — Get Out Of My Dreams, Get Into My Comments
Live in the Now
Oops!
Comment Etiquette: Please do not post spam. Please keep the comments on-topic. Please do not post unrelated questions or large chunks of code. And, above all, please be nice to each other - we're trying to have a good conversation here.