Analyzing Likert Scale Items as a Group

Aug. 17, 2022, 2:25 p.m.

Ally Bullifent, a user experience researcher, recently contacted us to clarify how to analyze Likert scale data. She was questioning how to group a set of Likert scale questions to facilitate analysis.

Despite my own reservations about Likert scales (I rarely use them), Likert scales remain very popular among researchers of all disciplines. Our blog post on how to analyze Likert scale items, Three Ways to Analyze Likert Scales - Avoiding the Likert Crush is one of our most popular blog posts, with over 16,000 views as of August 2022. In that blog post, we described that organizing a set of Likert scale questions into a coherent group, and analyzing them as a group, was one of the best ways to avoid corrupting your analysis. Ally asked for additional clarification on exactly how to do this.

Ally gave us the example of trying to measure the difference between two groups with respect to their attitudes towards dogs. She didn't give the details of her experiment, but, as an example, let's assume a user experience (UX) researcher is trying to compare two iterations of a web page. The web page has been designed to encourage potential pet owners to adopt a dog. The two iterations will be compared directly by randomizing the users to one of the web pages, and comparing the attitude towards dogs by getting each user to complete a short survey on a Likert scale. Comparing two different web interfaces is a common study method in UX design, and is colloquially called A/B testing.

Setting up the study design for this simple A/B test is not especially difficult or time-consuming, but getting it right from the outset is vital for a successful study. There are four critical steps: designing the scale, writing the survey, scoring the results, and analyzing the data.

1. Design the Scale

When creating a group of Likert questions that will be analyzed together, it is best to choose a single well researcherd scale for all the questions. The classic Likert scale, gives five options, and looks like this:

Strongly disagree
Disagree
Neither agree nor disagree
Agree
Strongly agree

We strongly recommend using the standard scale, and not inventing your own. The standard scale has been in use for decades, and repeated use gives it great construct validity. If you choose your own scale, the burden will rest upon you to prove its validity.

Dog with Happy Owner — Can a UX Design Change Attitudes Towards Dogs?

2. Writing the Survey

Once you have decided on a scale, you are ready to start writing the questions. The trick here, is that in addition to using the same scale, the questions should satisfy two criteria: 1) They should all be testing the same idea, and 2) they should all be pointing in the same direction.

Ally gave us the example of looking at people's attitude towards dogs. A group of questions that would work well together might look like this:

Please Indicate Your Agreement with Each Statement Using the Scale Above:

I like dogs
I would consider owning a dog as a pet
Dogs make excellent companions
Dogs are truly 'A human's best friend'
I feel comfortable around dogs

In reviewing the statements above, we see that they form a natural group. They are all about dogs. And, although each statement is measuring a slightly different attitude, they clearly are all similar.

Also, they all point in the same direction. That is, a person who really likes dogs would probably give Agree or Strongly Agree to all the statements. Conversely, a person who really hates dogs, would be expected to give Strongly Disagree or Disagree to all the questions. Double-checking to make sure that all the questions point in the right direction is critical. For instance, the statement "Dogs are smelly" would probably not be a good fit. People who hate dogs are more likely to agree with this statement, so, it is clearly in-congruent with the rest of the questions. As we will see in the next step, questions who point the wrong direction will destroy our ability to group them up.

3. Scoring the Results

Scoring the results is not difficult, but it is also not completely intuitive. Again, getting this right is critical, or the analysis will fail. It involves two steps: providing a point-type score for the statements, and then adding up the scores.

First, we will introduce an arbitrary score for each possible response on the scale. There is no single mathematically sound way to do this. But, the simplest, and most common way is to score them like this:

Strongly disagree (1 point)
Disagree (2 points)
Neither agree nor disagree (3 points)
Agree (4 points)
Strongly agree (5 points)

From a statistical sense, scoring like this is dangerous. Mathematically, it creates an expectation that one agree (4 points) is equal to four Strongly Disagrees (also 4 points). This is clearly not true, and it is here that statisticians and mathematicians will often object to this arbitrary assignment of points. Luckily, as we will soon see, a statistical theory called The Central Limit Theorem allows us to group up items even when this occurs. A warning here: this point assignment should only really be used for groups of statements. One should be very cautious (we never recommend it) in analyzing single statements with this type of points system.

The second part of the scoring is summing up the scores. The important concept here is that the sum is across what we statisticians call the "Experimental Unit." And, we can do this simply, by scoring the survey just like you would score a multiple-choice test. What we will do here is find the total score for each participant. In our example above, the participant's score would be a maximum of 25 if they answer Strongly Agree to each question, and a minimum of 5 if they answer Strongly Disagree to each question. If you are setting up a data table (like a spreadsheet) you would do so just as if it were a multiple choice test. Each row would be a single participant. The first column is the name or ID. The second column is the experimental group (A or B in our example). The third column is the total score (from 5 to 25).

4. Analyze the Data

As is usual with most statistical analyzes, once the data table is set, the analysis is often straightforward.

In our dog example, we are interested in the difference between the group A and group B in their total score. One possibility is to use so called non-parametric tests. These are tests used when both the sample size is small and the underlying population is non-normal. For example, we could use the median of the scores to indicate the central tendency, and use the interquartile range to measure the spread. To compare the two groups we could use the two sample Wilcoxen test, also called the Wilcoxen Rank Sum Test, which can show us if the difference in the scores between the two groups is statistically significant.

However, even better, since the data from the summed scores can be considered to be Normal, we are free to use the mean and standard deviation as descriptive statistics. In addition, we are free to use the t-test, ANOVA, F-test, or linear regression when appropriate. In our dog example, we can easily use a two sample t-test to compare the groups, with the total score for each participant being the raw data.

How Does This Even Work?

How does it work? How can it be true that each of the Likert scale statements is clearly not Normally distributed, but, the final total score is? This is the Central Limit Theorem. The theorem states that when independent random variables are summed up, their sampling distribution tends to the Normal distribution, even when the original variables themselves are not normally distributed. For review, the Normal distribution is the classic bell-curve shaped distribution. And, for statisticians, the Normal distribution is a type of panacea. Variables that follow the Normal distribution can be analyzed using so called Parametric Methods. This means that the variables can be summarized by simple, intuitive descriptive statistics such as mean and median. In addition, fantastic parametric tests and models: the t-test, F-test, ANOVA, and linear regression for example are now all in our reach.

Multiple choice tests are a great example of the Central Limit Theorem in action. Most multiple choice tests are scored with each question being either right or wrong. Usually, you get one point for every question that is correct. Clearly, the response for a single question is not a Normal distribution. There are only two possible scores, 0 or 1. And, likely, if the test is measuring what the participant should know, then the chance of getting 1 may be very different from the chance of getting 0. When scoring a multiple choice test, the score of all the multiple choice questions is summed up. For example, in a test of 50 questions, the lowest score would be 0 and the highest would be 50. As predicted by the Central Limit Theorem, the distribution of the final scores is likely to be normal. Thus, the total scores of the students can be analyzed using standard normal analysis such as mean, standard deviation, t-tests, and the like.

The Central Limit Theorum and Likert Scale Analysis — The Central Limit Theorum

To Likert or Not to Likert

What's the final verdict? If given the choice at the outset of a study to avoid the Likert scale, we always recommend forgoing the Likert Scale and using a linear response scale. However, there are times, when the Likert scale cannot be avoided. In this case, we recommend using the Likert Scale as was originally described by Likert himself: in sets of statements forming a cohesive group.

Unfortunately, Likert scale analysis is always controversial. If you are writing a paper using a Likert scale, be prepared to defend your analysis techniques to the reviewers. Doubtlessly, at least one of the reviewers will have not seen this type of analysis, and will object to your methods. Using the technique described here of using a single unified Likert scale, grouping the statements to ensure that they are all addressing a single question and point in the same direction, scoring appropriately so that each observation is the sum of the participants ratings, and examining the summed scores as Normally distributed is the best way to minimize the controversy.

If you are looking for a quick summary of our best practices for Likert scale analysis, get the Likert Analysis Infographic here. And, please share this blog post on social media if you found it helpful.