Artifacts or Abduction: How Do LLMs Answer Multiple-Choice Questions Without the Question?

A suspicious result

Give a model only the answer choices, no question stem. It still picks correctly far above chance on several MCQA datasets. Either the benchmarks leak signal, or the model infer something from choice wording alone.

Three explanations

The authors test memorization, choice priors, and question abduction (inferring the missing question from choices). No single factor explains everything. Group dynamics among choices matter. Sometimes the model reconstructs a question close to the original.

For evaluators

MCQA is convenient but fragile. This paper argues for stronger baselines, harder datasets, and skepticism when accuracy jumps without understanding why. A useful warning as LLMs saturate standard tests.