__Quant QUISª:
Information About Quantitative Analysis__

This
note outlines several approaches that have been taken in the analysis of QUISª
data.

__Forming
Data Sets__: Once the QUISª data
has been collected, the first decision is whether to analyze the data as a
whole or in groups. Obviously, if
you are comparing two different pieces of software you will want to group the
data by, say, Software Package A and Software Package B. On the other hand, if you are
interested in differences between types of users of the same package, you would
group on groups of individuals, say, User Group 1 and User Group 2. When forming data sets, we will be
interested in statistics within each group as well as hypotheses tests between
groups. LetÕs first consider the
statistics within groups.

__Profiles
and Diagnostic Tests__: One of
the most useful analyses, particularly for iterative testing and design, is the
profile. The profile reveals the
strengths and weaknesses of the software program or workstation by showing the
deviations of the means above and below a criterion. A profile is generated by calculating the means and standard
deviations for each item in the QUIS.
The means are then graphed on a scale from 1 to 9 as shown in the figure
below:

The
midpoint of the rating scale (5) can be used as a criterion. If the item is above 5, it is perceived
as being better than an arbitrary, mediocre value. However, that is generally not good enough. We may also use the overall mean of the
group as a criterion. Such a mean is
shown in the figure.

It
is useful to plot a confidence interval around each mean in order to determine
its reliability. The confidence
interval also indicates whether the mean of an item is significantly above or
below some criterion. For example,
if a 95% confidence interval includes 5 within its boundaries, then it
indicates that the mean is not significantly different from 5 at the .05 level
of significance.

The
profile can be used to identify the areas in the application which are particularly
good or particularly bad. Start
with the item having the lowest mean.
Identify flaws in the software that may have led to this low mean. Then go to the next lowest item and
repeat. Do this until you are
satisfied that you have identified the major problems. Then start with the item having the
highest mean. Ask yourself why
this aspect was rated so high and how it can be used to further enhance the
software. Then go the next highest
item and repeat. Again, do this
until you are satisfied that you have identified all of the strong points of
the software.

A
more sensitive and statistically powerful technique for identifying the
strengths and weaknesses is to use a within-subject approach. For each respondent, compute a mean of
all of his or her ratings. Then
for each of the respondentÕs ratings get the deviation between that rating and
the respondentÕs mean, (d_{ik} = X_{ik} - M_{.k}; where X_{ik} is the rating for item *k* by respondent *i* and M_{i}. is the mean for respondent *i* across all items). A simple t-test on these deviations to see if they are
significantly different from zero for a particular item will indicate whether
the item is perceived as high or low relative to each respondentÕs average
rating. In general, it is nice to
have a sample size of at least 20 for statistical purposes. However, realizing that many usability
limit their samples to about 10, it is suggested that one avoid statistical
tests and generalizations by presenting only means and focusing on the highest
and lowest ratings.

__Comparing Groups__: When your data is composed of groups,
you make comparisons between groups at the overall level and at the level of
individual items. However,
remember that the more statistical tests you run, then greater your probability
of a Type I Error (a spurious result).
To guard against that, you should consider only using the .01 or .001
level of significance.

For
overall comparisons you may find the mean of the Overall Ratings (3.1 to 3.6)
for each respondent in each group.
Then compare the group means using a t-test. Means may also be computed for sections of the QUISª or for
all of the items on the QUISª and compared between groups. Or, of course, you may make comparisons
at the individual item level. But
again, beware of Type I Error.

__More Sophisticated
Analyses__: The number of
analyses that one can play with is nearly endless, if you have enough data
points. One should be cautious of
over-analyzing the data. You will
be bound to find something fascinating but unreliable. Nevertheless, some interesting
additional analyses can be used to investigate the correlational structure of
the items. These analyses can
reveal the underlying importance or relevance of items to the users and to
overall satisfaction. These
include: factor analysis, item
analysis, and hierarchical regression analysis.