Errors in the MMLU: The Deep Learning Benchmark is Wrong Surprisingly Often

6 min readAug 23, 2023

The Massive Multitask Language Understanding (MMLU) dataset is one of the commonly used datasets to assess Large-Language-Models (LLMs). HuggingFace’s popular “Open LLM Leaderboard” uses the MMLU as one of the four benchmarks to rank the top LLMs.

The HuggingFace Open LLM Leaderboard uses the MMLU to rank models

The MMLU includes multiple choice question on 57 subjects ranging from “abstract algebra” to “world religions”. The questions vary but they generally look like:

As of 2017, how many of the world’s 1-year-old children today have been vaccinated against some disease? (A) 80% (B) 60% (C) 40% (D) 20%

There’s surprisingly little detail on how the MMLU was built or where it came from. The original paper says:

The questions in the dataset were manually collected by graduate and undergraduate students from freely available sources online

But which websites? Are those websites reliable? Are the questions good? Are the answers even right? Spoiler: There are bad questions in the corpus. For example:

The most widespread and important retrovirus is HIV-1; which of the following is true? (A) Infecting only gay people (B) Infecting only males (C) Infecting every country in the world (D) Infecting only females

The MMLU says the answer is A but the answer is obviously meant to be C. This question appears to have been copied from Oxford University’s website but the answer listed there is correct. Was the answer on that website corrected? Or did the students make an error when copying the question?

Analyzing The MMLU

Originally my goal in looking into the MMLU was to get a sense for how smart modern LLMs are. I wanted to know if I’m still “smarter than an LLM”. The original MMLU paper asserted that

Unspecialized humans from Amazon Mechanical Turk obtain 34.5% accuracy on this test. Meanwhile, expert-level performance can be far higher…

For a multiple choice test with only four choices per question 35% is pretty bad. Is that because the Mechanical Turk users weren’t putting in much energy or are the questions really just that hard? The authors suggested that “expert level” performance would be nearly 90% but who is an expert in “abstract algebra”, “world religions” and “econometrics”?

My first idea was to look at the questions where the AI was “most wrong”. Fortunately HuggingFace recently released the dataset that underpins their leaderboard so we can do this analysis. Hat tip to Corey Morris who already did some interesting analysis on this dataset which alerted me to the existence of this dataset.

I decided to focus my analysis on the model “Platypus2–70B-instruct” which at the time of this writing is the top performing model on the Open LLM Leaderboard. “Platypus 2” gets ~70.5% of the questions on the MMLU correct (for context, GPT-4 gets 86.4%).

Questionable Questions

When you look at the questions where Platypus 2 was most confidently wrong you start to see a pattern. These aren’t just cases where the model was tricked these are just errors in the MMLU!

Here is a sample of some of the questions in the MMLU that seem problematic. The bold choice is the answer the MMLU says is correct.

Which of these sentences is written in the subjunctive?
A. I am not your man
B. I wish I were your man
C. Wherefore art your man?
D. Your man is where?

This seems to be a straightforward error. D is an interrogative and it is B that is in the subjunctive. Platypus 2 gets this question right with high confidence but it is marked wrong.

Some questions are just unanswerable.

As a result of an accident, Abdul lost sight in his right eye. To judge the distance of vehicles when he is driving, Abdul is able to rely on cues of
A. I only
B. II only
C. III only
D. I and II only

You didn’t read that wrong. The question never explains what I, II or III are. This appears to have been improperly copied from crackap.com. Somehow Platypus 2 still gets the right answer with high confidence. Is this a sign it has merely memorized the answers? I checked the second best ranked model upstage/LLama-2–70b-instruct-v2 and it also somehow got the answer right (the third best Open LLM also gets this question right so I don’t know what is happening).

The more philosophical questions seem especially important to interrogate given the emphasis of aligning AI with humans. For example:

For which of these two scenarios does the main character (who uses I/me/my) do something clearly morally wrong, according to ordinary moral standards in the US as of 2020?
Scenario 1 | I knew my girlfriend had a fur allergy so I surprised her with a Persian cat.
Scenario 2 | I didn’t go to school because I wanted to watch television.

Scenario 1 is definitely immoral but what about Scenario 2? Skipping schools for TV seems unwise, but is it immoral? I don’t think so but the MMLU says both are immoral. Platypus 2 and ChatGPT 3.5 both say Scenario 1 is immoral and Scenario 2 is not.

For which of these two scenarios does the main character (who uses I/me/my) do something clearly morally wrong, according to ordinary moral standards in the US as of 2020?
Scenario 1 | My farm is struggling to make money, so I butchered 10 cows to sell their meat.
Scenario 2 | I sold my prescribed painkiller.

This question also seems problematic. I could make the argument that slaughtering cows is immoral and selling painkillers is not. The MMLU says Scenario 1 is moral and Scenario 2 is immoral. Platypus 2 says the exact opposite. Is it really wrong?

Some questions are flawed because their answers have become outdated. For example,

According to the USDA which food group should you eat the most servings of per day?
A. vegetables
B. dairy
C. meats
D. breads

The USDA used to say you should eat mostly bread, but that hasn’t been true for years. Questions about changing facts really need to include dates.

Some questions are probably correct but I think evaluating these models on this basis would be fairly controversial amongst the public.

Heterosexual fantasies about sexual activity never involve someone <blank>, and gay and lesbian fantasies never involve persons of <blank>
A. Both heterosexual and homosexual fantasies may involve persons of the same or other gender
B. of the other gender; of the same gender
C. of the same gender; of the same gender
D. of the same gender; of the other gender

Platypus 2 answered B with low confidence. Similarly the answer to this question is clear but I think many people would take exception to it.

Which of the following factors does NOT play a role in gender relationships in a culture or society?
A. Biology
B. Advancements in technological development
C. Religion
D. Custom

Platypus 2 confidently reported B as the answer. The question is taken from crackap.com and they report the answer as A.

Tricky Questions

But that’s not to say that every time Platypus 2 was confidently wrong it was due to a problem in the MMLU. Sometimes the model really does just get confused. For example:

Write the prime factorization of 48. Use exponents where possible.
A. 4 • 8
B. 6 • 8
C. 2³ • 3²
D. 2⁴ • 3

Where Platypus 2 is confident the answer is C. BingChat gets this one right.

Hare claims that once two people agree on the meaning of the term “ought”:
A. they will share all the same moral opinions.
B. they will disagree morally only if they disagree about the facts of the case.
C. they will disagree morally only if they have different inclinations.
D. none of the above.

A lot of the questions are like the above where you need to have the context to know who the question means by “Hare”. A lot of people have had that name.

Conclusions

So how a big deal is this? There are 14,079 questions in the MMLU. I’ve identified a few dozen problematic questions in the corpus. There are almost certainly more errors but it’s pretty hard to check because some of the questions are about more specialized subjects. I mean can you quickly tell if this question is right (I believe it is)?

A subset H of a group (G,*) is a group if
A. a, b in H=> a * b in H
B. a in H => a^-1 in H
C. a, b in H=> a * b^-1 in H
D. H contains the identity element

I think the conclusions we should take here are

More energy should be spent making sure these benchmarks are valid
Small differences in MMLU scores should not be taken very seriously
Getting a 100% in the MMLU is impossible without cheating

And this also makes me wonder if other benchmarking datasets have errors.

Errors in the MMLU: The Deep Learning Benchmark is Wrong Surprisingly Often

Analyzing The MMLU

Questionable Questions

Tricky Questions

Conclusions

Written by Daniel Erenrich