The other day, I was thinking back to the year 2000 and the website, Hot-or-Not. If you are not familiar with this website, the concept was simple: men and women uploaded their photos and then you rated their attractiveness on a scale from 1-to-10 (10 being the most attractive). As trivial as this may sound, rating on a scale from 1-to-10 actually stressed me out. Thirteen years later, the same thing holds true. I don't like scales from 1-to-10; and, I think the reason for this is that selecting an appropriate rating poses a serious user experience (UX) problem.
| || || |
| || |
| || || |
On A Scale From 1 To 10
In a complete vacuum, if looked at a scale from 1-to-10 where 1 is the least attractive and 10 is the most attractive, what would a 5 indicate? Logically speaking, a 5 is in the middle - it is neither attractive nor is it unattractive. So, if you rated someone a 5, logically speaking, they should neither feel (too) insulted nor (too) complimented.
The problem with this Spock-like analysis is that we don't live in a vacuum. In fact, most of us have at least 12 years of deeply influential experience under our belt. Namely school and scholastic testing. In school, tests were typically graded on a scale from 1 to 100, which can be easily mapped to a scale of 1 to 10 in the mind. And, what did school teach us about this scale:
- 1 - Fail
- 2 - Fail
- 3 - Fail
- 4 - Fail
- 5 - Fail
- 6 - Basically Fail
- 7 - Passing
- 8 - Good
- 9 - Great
- 10 - Excellent
Now, in this context, we have an extremely different picture. 5 is no longer indifferent; 5 is decidedly bad - very bad. And, so is 6 for that matter. Maybe when you get to 7 you can start to feel good; but, really, you should be aiming for an 8 or above.
My point here is not that test-taking is bad; my point is that rating things on a scale of arbitrary size is a shockingly complex user experience. Not only does it require a rich and consistent mental model, but it has no way of taking personal experience into account.
A better user experience (UX) would be to get a user to select an action that he or she would theoretically take. Different users will have different motivations; but at least with a selected action, we add a meaningful abstraction layer between analysis and outcome.
The Likert Scale - How Strongly Do You Agree?
The Likert scale is the most widely used approach to scaling responses in user surveys. The Likert scale starts to move in the right direction in that it is geared more towards Action rather than arbitrary scale. The problem with Likert, however, is that it pairs a set of nuanced choices with a non-nuanced statement.
A typical Likert scale question will pose a statement and then ask you how strongly you agree or disagree with said statement. So, for example, after staying at a Hotel, you may be presented with the following survey question:
I would recommend this hotel to my friends and family:
- Strongly Agree
- Strongly Disagree
This approach is almost good because it poses the question in terms of an action: recommending the hotel. The downfall here is that the user's relationship to said action is inappropriately nuanced. Actions are not nuanced; they are black and white. As Yoda once said, "Do. Or do not. There is no try."
Decision making is definitely nuanced. Decision making is very complex. Decision making calls on personal history, culture, knowledge, self-esteem, analysis, context, pros, cons, etc.. But, once you make a decision, the outcome is simple - yes or no; agree or disagree.
Trying to merge these two different gestures into a single question forces the user to start performing overly-complex mental gymnastics. For example, I recently stayed at the Radisson Blu hotel at this year's cf.Objective conference out in Minnesota.
Would I recommend this hotel: Yes.
How strongly do I agree with this decision? Well, now I have to think about it, not just in my context, but in the context of the people to which I would give the recommendation. The food was great. Everybody loves good food, right? But, they don't stock Monster energy drinks at the bar, only RedBull. Monster is really important to me because it's my caffeine. So, every night, I had to remember to trek into the Mall of America to buy Monster for the next morning. That kind of sucked. A lot. But, also I know that not every one drinks Monster... so, should I let that influence my decision? Maybe I don't strongly-agree... maybe I only agree. But, now I'm feeling a lot of anxiety. What if I recommend this hotel and the people don't like it; am I going to be judged? What are they going to think about me?
At this point, you either try really hard to provide an answer based on deep thinking and exhaustive analysis; or, you do what I typically do and just say, "Screw it, Strongly agree."
In either case, the answer is probably not desirable; it's either overly simplified or overly complex.
Leverage Actions And The Wisdom Of Crowds
To provide the best user experience (UX) for rating things, you should present the user with a small set of clear-cut actions. While this is easy for the user, it is hard for the User-Experience Engineer. Anyone can throw up an arbitrary scale and get feedback; but, it takes deep analysis to figure out how to map a context onto a set of mutually-exclusive actions.
Take movie-ratings, as an example. Typically, movies are rated on a four-star scale. And while this is a relatively small scale, it suffers from all the same short-comings that the Hot-or-Not 10-point scale presents. Specifically, it doesn't abstract the decision making behind a set of outcomes.
I live this problem all the time when me and my Girlfriend are trying to select a movie to rent. She sees a two-star rating and thinks, "Eww, it only got two stars, no way am I watching that!" I, on the other hand, see the same two-star rating and think, "Two stars - it's got to be at least decent, let's give it a shot."
The problem is that the two of us have two different mental models on which to draw. This allows us to consume the same information and make two different decisions. Neither of us is wrong.
If, however, movies were rated based on actions, rather than stars, our two mental models may be more in alignment. So, as an example, instead of stars, what if movies were rated based on three mutually-exclusive actions:
- See it in the theater.
- Wait for the rental.
- Don't ever watch it.
Then, rather than aggregating the wisdom-of-crowds as a star rating, you could present the distribution of choices:
- 54% of users said, "See it in the theater."
- 37% of users said, "Wait for the rental."
- 9% of users said, "Don't ever watch it."
If this result was presented as a star-rating, maybe it would get two-stars, maybe two-and-a-half (it's hard to tell). But, looking at the results in this format, here's what I can conclude:
91% of all users said, "see it" in one format or another, even if that format meant, wait for the rental.
This is powerful information; this is meaningful information; and, it's the kind of information that can be gathered most easily when users are asked to choose actions rather than attempt to explore, analyze, and expose their underlying decision-making process.
Looking For A New Job?
Ooops, there are no jobs. Post one now for only $29 and own this real estate!
I always think of this cartoon any time I have to give a rating:
Ha ha, that is awesome! And so spot-on for what I'm thinking - thanks for the link. I love XKCD, but I don't think I've seen that one.
I need to catch up on my XKCD and my Oatmeal :D
@Ben, Yeah, its a good topic... what does a rating really mean? I sometimes get why companies want to do the 1-10 system, but in the end its how that data is used that matters. I like your example of the movie interest because as an end user you can interpret more than a single view off the data.
I work for a major online retailer that uses 5 stars (including fractional) and I constantly want to make them just go Rating: Great (4+) or Poor (less than 4).
Great insights, Ben! I recently stayed at a hostel in Madrid and ended up rating them 4 out of 5 stars for staff and 3 out of 5 starts for safety. They wrote to me and asked why I rated them 80% and 60%. I see what you're getting at with the grading scale. While the number conversions were accurate, I would never have *failed* the security of the hostel, it just wasn't great, and the staff was pretty good, just not excellent. Mental models didn't meet!
It is perplexing. Rating systems are fraught with many issues. Most of these are issues where people who have a stake in the outcome of the ratings will game the system. Look at Amazon ratings. These are mostly useless if you just look at the star ratings. Yelp is similar. There are a lot of planted ratings and their are people who coped an attitude and gave bad ratings. How do we use these?
Mostly, I look at the written reviews and try to see if the text lets me eliminate some reviews that I know were PR plants. Other times, you can recognize someone who is real since you remember another review they did. Yelp reviews can be good when the place responds appropriately to a bad review. Some people are just entitled assholes.
I see where you are going with your thoughts. Survey systems have done a lot of work in this area so you might want to look up some of that information.
Completely agree with your points, Ben. I worked for a major US government agency where our team government lead had his bonus and performance rating based on the average feedback score given on the agency's public website on a scale of 1-10. If I recall it hovered somewhere in the 7.3-7.8 range and his 'target' was something like 7.5. In weeks and months where the score was on the low end we constantly received heat for it and I don't think any of us understood how the guy's performance could be tied to a customer feedback score on a website.
That is terrible... 7.5 is a 75% satisfaction rating which is practically unheard of in online retail let alone your situation:
Well said @Ben,
Basically the rating of any entity depends on what you want to measure.
About a month back I tried to implement something like this, I wanted to measure the relevance of a comment in a discussion.
I added these ratings,
3. Well said
7. Very poor
And asked few people to rate a complete twitter discussion on a spreadsheet, interestingly none of them used 2,4,7. Now I know how they are thinking.
Just my 2 cents.
Not only do ratings systems attempt to hammer the unquantifiable into a simple statistical model, but we've turned them into the biggest selection bias error since "Dewey Defeats Truman".
The question isn't "How do we make ranking systems better", it's "Why haven't we scrapped this clearly unworkable system for something better?"
In a blatant self-plug:
Sorry to keep jumping in, but this thread got me thinking and I threw together this demo:
Needs graphical assists like replacing the "1 star" text with stars (or merging the two bars into 4 big stars you select/slid between). I could also see showing only the label for the currently selected rating just below that rating's star.
I think the textual meaning is really useful. It helps guide users into understanding what the rating means as they are choosing.
Also, I'm not a fan of neutral ratings. People like or dislike something, even if only slightly.
@Jon - I always saw 'neutral' as 'I dont really care' or, in some cases, 'this doesnt apply to me but you didnt give me an n/a option' rather than an actual opinion.
@Joe, You're right that an N/A response may be appropriate in some cases, depending on the survey and question. If I'm trying use a one-question rating to gauge a product though, I'd think that you wouldn't be interested in someone that doesn't really have an opinion.
I also think of it this way... if the user is bothering to respond to solicited feedback, they don't need another out; they can choose not to participate without needing a form option to allow them to participate by not really giving an answer.
N/As are pretty much require for multi-part surveys though, and if you give one then I think that N/A is better than a neutral answer; if you're being asked to rate something good or bad and you don't care, its really N/A not neutral.
A neutral selection is needed in most cases. It helps you since it increases your population of responses. Ratings starts having increased validity as n increases. In an extreme example, you have ten positive and a thousand neutrals. If all you had was ten positives, you would assume a different answer than if you have 1010 responses with only 10 positive.
I also think that not applicable (N/A) is different than neutral. N/A denotes that someone didn't have an experience with the product vs. neutral which would mean that my experience was neither good or bad.
In general though, these are the sorts of things that professional survey designers think about when constructing their instruments. A lot of money is put into designing surveys well so they get answers that have a good amount of confidence.
Folks are correct in asking all these sorts of questions up front so they can get the answers they want from the data. People have lost a lot of money from badly designed polls.
Amazon rating are a really interesting experience for me (as an Amazon user). I think it really demonstrates the point of view from which I am experiencing purchasing books - with fear. I'm not afraid of books; I'm afraid of making a bad decision. Since I'm a really slow reader (I read "out load", but in my head), the time-cost of a book is really high. And, I'm bad about stopping a book before it's done - I tend to just muscle through it.
So, when I get to Amazon and I see that it 13 5-star ratings and 1 1-star rating, I immediately bypass the 13 growing recommendations and see why this one person thought the book was a disaster.
What did this *one* person see that 13 other people didn't see? Are they crazy? Or are they the only one who's sane?
I put way too much importance on the negative minority and too little important on the positive majority. It's probably a really unhealthy way to buy books :(
Word up! That said, how was Madrid?
That is terrifying! Especially when you consider that people who are *happy* with an experience are probably not as likely to say *anything* at all.
Sounds like a really cool experiment! I love (and am forever frustrated) by trying to figure out why people react to different interfaces in different ways.
Excellent read - I left a comment on your blog.
Cool JSFiddle - I like where you're going. It helps make sure that a star rating is in alignment with its subsequent interpretation.
@Jon, @Joe, @Roger,
I definitely go back and forth about Neutral responses. I tend to not like them because I like that people have to commit, if only slightly, in one direction or the other. That said, I do think a "this doesn't apply to me," or a "I don't care about this topic," kind of answer can definitely be a good thing.
I would recommend Ben's post to others:
. Strongly Agree
. Ben should be president
. Ben should have a nobel peace prize
. See it in the theater
. Pray every day that ends in 'y' that his blog is never shut down
. Chuck Norris
Ha ha, I just saw this comment - it totally made my day, thanks!