Simulating LLM Evaluation Datasets Using Psychometric Models
TL;DR: Psychometric models such as IRT can be used to simulate and calibrate LLM benchmarks. They estimate model ability, as well as item difficulty, discrimination, and guessing. This supports graded scoring and multi-sample evaluation, helping build stable, interpretable, and uncertainty-aware datasets that guide better benchmark design.
Background
In psychometric modeling, we oftern need to turn scores into probabilities. We feed in a score (e.g., ability minus difficulty), and out comes a sensible probability of success. Three common smooth squashing functions do that job and show up in IRT:
All rise smoothly, center at one half when , and approach 0 or 1 in the tails. IRT practitioners often use the logistic because it has a simple closed form and nice derivative properties. It's also called sigmoid: .
When is large and positive, this approaches 1. When is large and negative, it approaches 0. At , we get exactly 0.5. The curve is smooth, symmetric, and has a nice S-shape that captures how probabilities transition from impossible to certain.
If we plot these three functions side by side, they overlap so closely we'd need to zoom in to see the difference. The sigmoid climbs smoothly from near-zero to near-one as ranges from about to . The steepest part of the climb happens right at , where small changes in the input produce the biggest changes in probability. This is exactly what we want when modeling test responses: items near a person's ability level are the ones where their probability of success is most sensitive to small shifts in skill. .webp)
Use case in IRT modeling: Because ability and difficulty live on an unbounded scale (any real number), but probabilities must stay between 0 and 1. The sigmoid family bridges that gap. When we write something like , we're saying: compute the distance between ability and difficulty, then smoothly convert that distance into a probability. Positive distance (ability exceeds difficulty) pushes the probability above 0.5. Negative distance pulls it below. The sigmoid handles the translation automatically, ensuring our model always outputs valid probabilities no matter what parameter values we estimate.
How to Model a Dataset
Let's assume a student, Kuzco, is sitting for a standardized test. The test has 30 questions, each targeting a different concept. Some questions are straightforward algebra, while others involve tricky geometry proofs. Kuzco has a certain level of mathematical ability (we don't know it yet, that's what we're trying to measure), and each question has its own level of difficulty (either unknown or known). What we observe is simple: Kuzco answers each question correctly or incorrectly (1/0). This means we have observable outcomes (0/1), and we want to infer hidden quantities: how strong Kuzco really is, and how hard each question really is. Once we know those, we can predict how Kuzco would perform on new questions, compare Kuzco fairly to other students, and figure out which questions are actually measuring what we think they're measuring.



Now let's formalize this step by step by building IRT models.
Modeling Difficulty and Ability
The first question we may ask is: does Kuzco's ability exceed a question's difficulty? If Kuzco's ability is higher than question 's difficulty , we expect Kuzco to probably get it right. If is lower than , we expect Kuzco to probably get it wrong.
The Rasch model (1PL) captures this with elegant simplicity. The probability that Kuzco answers question correctly is:
Let's say Kuzco has ability (on a standardized scale where 0 is average). Question 5 is moderately easy with . The difference is , so . Kuzco has a good chance of getting this one right.
Question 12, however, is quite hard with . Now the difference is , giving . More likely than not, Kuzco will miss this question.
Notice what's beautiful here: we're not just counting raw scores. We're building a model that explains the pattern of which questions get answered correctly. Two students with the same raw score might have very different abilities if they got different subsets of questions right. Similarly, two questions with the same pass rate might differ in difficulty if they're passed by different groups of students.
The mathematical form has an intuitive interpretation. The quantity is the "distance" between ability and difficulty. When that distance is 0, ability exactly matches difficulty, and we get , a coin flip. Every unit increase in shifts the probability upward along the sigmoid curve. The model assumes all questions discriminate equally well (we'll relax this shortly), and the only thing that differs between questions is where they sit on the difficulty scale.
Modeling Discrimination
Not all questions are created equal. Some questions are gentle slopes where a small increase in ability barely changes our odds of success. Others are sharp cliffs: students just below the threshold almost always fail, while students just above almost always pass.
This is where the discrimination parameter comes in. It controls how steeply the probability curve rises as ability increases. A high discrimination () means the question is good at separating students who know the material from those who don't. A low discrimination () means the question is less informative, perhaps because it's ambiguously worded or tests multiple unrelated skills at once.
The 2-parameter logistic model (2PL) incorporates discrimination:
Back to Kuzco. Question 8 has difficulty and discrimination . With Kuzco's ability , we compute , giving . The high discrimination amplifies the small gap between Kuzco's ability and the question's difficulty.
Now consider question 15 with the same difficulty but lower discrimination . The gap is still , but now , yielding . The curve is flatter, so Kuzco's slight advantage barely moves the needle.
Discrimination tells us how informative a question is. High-discrimination items give us precise information about ability near their difficulty threshold. Low-discrimination items are noisier, they don't reliably distinguish between students at different ability levels. When we design a test, we want a mix: some high-discrimination items to pinpoint ability, and perhaps some lower-discrimination items to cover a wider range. But items with very low discrimination (say, ) might be candidates for revision or removal, since they're not contributing much signal.
Modeling Guessing
On a multiple-choice test with four options, even someone who has no idea can guess correctly 25 percent of the time. The models we've built so far ignore this. They assume that as ability drops far below difficulty, the probability of success approaches zero. But that's not realistic when lucky guesses are possible.
The 3-parameter logistic model (3PL) adds a guessing parameter . This is the probability of getting the question right even with effectively zero ability:
With probability , we guess correctly regardless of ability. With probability , the ability actually matters, and we use the 2PL curve to determine the odds.
Let's say question 3 is a four-option multiple choice, so we might set . It has difficulty and discrimination . For a very weak student with , we compute . Without guessing, , nearly zero. But with the guessing floor: . The student still has that baseline 25 percent chance from random guessing.
For Kuzco with , we get , so . Then . The guessing parameter lifts the entire curve, but it matters most at the low end where ability-based success is unlikely.
The guessing parameter is most relevant for multiple-choice formats. For open-ended or constructed-response questions, we'd typically set since there's no way to accidentally produce the correct answer. In practice, is often fixed based on the number of options (like 0.2 for five-choice, 0.25 for four-choice) rather than estimated from data, because estimating it freely can lead to unstable fits. Still, the 3PL gives us a more realistic model when guessing is genuinely part of the testing environment.
General Framework: 4PL IRT
We've added parameters for difficulty, discrimination, and guessing. There's one more twist: what if even a very strong student doesn't have a 100 percent chance of success? Maybe the question is slightly ambiguous, or there's a careless error that trips up even experts. The 4-parameter logistic model introduces an upper asymptote , setting a ceiling below 1:
Now the curve ranges from (the guessing floor) to (the maximum attainable probability). When and , we recover the 2PL. When and , we get the 3PL. The full 4PL lets both ends of the curve depart from the ideal 0-to-1 range.
Let's say question 18 is tricky even for top students. We might have , , (five-option multiple choice), and (even experts miss it 10 percent of the time). For Kuzco at , we compute , so . Then . Kuzco's odds are decent but capped below what the 2PL would predict.
The general form unifies all the standard models. Setting different parameters to their default values gives us:
- 1PL (Rasch): , , for all items
- 2PL: , for all items, but varies
- 3PL: for all items, but and vary
- 4PL: All four parameters can vary
The choice depends on our data and goals. More parameters give more flexibility but require more data to estimate reliably. In practice, the 2PL is a workhorse for many applications, the 3PL is standard when guessing matters, and the 4PL is reserved for cases where we have strong theoretical reasons to expect an upper asymptote.
From Students to Models
It happens all too often: we run two models on the same dataset, the ranking flips with a different random seed, and our confidence fades with each rerun. Both Kuzco and Llama Kuzco can be modeled in the same way as different expressions of the same underlying ability. Psychometrics offers a useful perspective. In Item Response Theory, we treat models like test takers with a latent strength, , and questions like instruments calibrated along a difficulty scale. The result isn’t just another score; it’s a map linking ability to success, item by item.
Suppose we have models, let's call them , and a dataset of questions. For binary grading, we record : model solves question or not. If we use multiple attempts per item (say, non-deterministic samples), we can keep them all or compress to counts. What matters is that beneath these outcomes lives a smooth curve, the probability of success as a function of ability and item parameters (Lord, 1980). That curve is our first character in the story.
Now let's imagine lining up our models on a single axis, low on the left, high on the right. Each question stakes a flag on that line: "I become solvable around here." Easy questions plant their flags on the left, hard ones on the right. In the simplest model (1PL/Rasch), all items discriminate equally, they're just positioned at different difficulties. A more flexible model, the 2PL, lets items vary in how sharply they distinguish strong from weak models (the discrimination parameter ). If there's multiple-choice guessing, the curve never drops to zero, that's the term in the 3PL. Fit the curves, and we get more than a leaderboard: we get a map of where each model lives, which questions truly distinguish them, and how the dataset behaves across the whole spectrum of abilities.
In practice, we start with the binary families. The 1-parameter Rasch model pins every item to a difficulty and gives each model an ability (Rasch, 1980), with success probability:
The 2-parameter model sharpens the picture with a discrimination slope (Birnbaum, 1968; Lord, 1980):
When options allow lucky breaks, the 3-parameter version adds a guessing floor (Lord, 1980):
We fit these by maximizing a (possibly binomial) likelihood or placing priors and doing Bayesian inference (Lord, 1980; van der Linden & Hambleton, 1997). Because is only defined up to an affine transform, we fix the scale (for example, mean zero and unit variance). The reward for this calibration is clarity: ranks difficulty across items, tells us which questions meaningfully separate close models, and situates each model on a common ruler so comparisons stop wobbling with decoding noise.
Rubric-based grading extends the same logic. In Samejima's Graded Response Model, each item has ordered thresholds and a common slope (Samejima, 1969). The cumulative chance of earning at least category is:
and the category probability comes from adjacent differences. The expected score , normalized by the maximum category value, plays the role of "probability correct."
Information curves add the second voice to our narrative: reliability. For the 2PL, the item Fisher information at ability is (van der Linden & Hambleton, 1997):
with . Summing across items gives the test information function, showing where our dataset truly measures ability well, and where it goes quiet. This is how we design stable "Easy/Medium/Hard" splits without guessing.
An end-to-end pass looks like this: assemble responses from models over items, choose a family (1PL/2PL/3PL for binary, GRM for rubrics), fit with identifiability constraints, then audit model-item fit and report calibrated parameters with uncertainties (Chen et al., 2021; van der Linden & Hambleton, 1997). The upshot is a dataset that doesn't just score models, it locates them and tells us what to add next.
Simulating IRT Curves
Below we can adjust item parameters (difficulty ; discrimination for 2PL/3PL/GRM; guessing for 3PL), switch among 1PL/2PL/3PL and GRM, and paste a list of model abilities to see expected performance. Watch how steeper discrimination slices the pack more sharply, how lifts the floor, and how GRM turns categories into a smooth expected score.