jeffrey heer >> blog >> paper: why and when five users aren't enough

September 03, 2003

paper: why and when five users aren't enough

Why and When Five Test Users aren’t Enough
Alan Woolrych and Gilbert Cockton
Proceedings of IHM-HCI 2001

This paper argues that Nielsen’s assertion that “Five Users are Enough” to determine 85% of usability problems does not always hold up. In the end, we walk away with the admonition that five users may or may not be enough. Richer statistical models are needed, as well good frequency and severity data. What does this mean for evaluators? Certainly this shouldn’t dissuade the use of usability evaluations, but it does imply that one should avoid false confidence and keep an eye to user/evaluator variability.

The paper starts by attacking the formula

ProblemsFound(i) = N ( 1 – ( 1 – lambda ) ^ i ),

in particular, the straightforward use of parameter (lambda = .31). Generalizing the formula shows we should actually expect, for n participants, that

ProblemsFound(n) = sum(j=1…N) ( 1 – ( 1 – lambda_j) ^ n ),

Where lambda_j is the probability of discovering usability problem j. Nielsen and Landauer’s formula assumes this probability is equal for all such problems (computed as lambda = the average of such empirically observed probabilities).

However, other studies, such as that by Spool and Schroeder, have found an average lambda of 0.081, showing that a study with ecologically valid tasks (in this case an unconstrained online shopping task with high N) can still miss many usability issues. Thus Nielsen’s claim that five is enough is only true under certain assumptions of problem discovery.

But other issues also abound. For instance, Nielsen’s model doesn’t take into account the variance between users, which can strongly affect the number of users needed. Further complications abound when considering severity ratings, as the authors found huge shifts in severity ratings based on different selections of five users. Other problems include which tasks are used for the evaluation (changes of task revealed undiscovered usability issues) and issues with usability issue extraction, determining the true value of N.

Posted by jheer at September 3, 2003 07:16 PM

Comments

I wrote about an article in the recent Interactions magazine that addresses the foolishness of these "how many is enough?" articles.

Posted by: peterme at September 4, 2003 11:42 AM

Whoops. Forgot the link.

Posted by: peterme at September 4, 2003 04:25 PM

oh. you can't do hrefs here. anyway:

http://www.peterme.com/archives/000122.html

Posted by: peterme at September 4, 2003 04:26 PM

Thanks for the pointer! Interestingly, in Nielsen's book chapter on Heuristic Evaluation, he does go to some pains to argue for the cost to benefit ratio of the testing, NOT just finding the most usability problems (as your summary of the Wixon article indicates as a focus of the larger debate). That analysis, however, uses the projected "optimal" number of users as part of the model, as well as cost numbers which may or may not be accurate. I have the sinking suspicion that a real statistician could rip the significance of such models to shreds. In my limited experience, these are not the sorts of things that reliably lend themselves to simplistic formal models.

Posted by: heerforce at September 4, 2003 10:40 PM

Trackback Pings

reading rainbow
Excerpt: > blog >> paper: other ways to program (heerforceone)" href="http://jheer.org/blog/archives/000063.html">oh my lord, heerison forcifer has been busy at grad skool! And it all looks fascinating. Here I go, scavenging his reading list again....
Weblog: Metamanda's Weblog
Tracked: September 6, 2003 01:01 AM

Trackback URL

TrackBack URL for this entry:
http://jheer.org/cgi-bin/MT-2.64/mt-tb.cgi/57