Simulating LLM Evaluation Datasets Using Psychometric Models

TL;DR: Psychometric models such as IRT can be used to simulate and calibrate LLM benchmarks. They estimate model ability, as well as item difficulty, discrimination, and guessing. This supports graded scoring and multi-sample evaluation, helping build stable, interpretable, and uncertainty-aware datasets that guide better benchmark design.

Background

In psychometric modeling, we oftern need to turn scores into probabilities. We feed in a score (e.g., ability minus difficulty), and out comes a sensible probability of success. Three common smooth squashing functions do that job and show up in IRT:

\text{logistic:}\quad \sigma(x)=\frac{1}{1+\exp(-x)}\in(0,1),

\text{tanh rescaled:}\quad \tau(x)=\frac{1+\tanh(x)}{2}\in(0,1),

\text{probit:}\quad \Phi(x)=\int_{-\infty}^{x}\frac{1}{\sqrt{2\pi}}\exp\bigg(-\frac{t^2}{2}\bigg)dt\in(0,1).

All rise smoothly, center at one half when $x=0$ , and approach 0 or 1 in the tails. IRT practitioners often use the logistic because it has a simple closed form and nice derivative properties. It's also called sigmoid: $\sigma(x) = \frac{1}{1 + e^{-x}}$ .

When $x$ is large and positive, this approaches 1. When $x$ is large and negative, it approaches 0. At $x=0$ , we get exactly 0.5. The curve is smooth, symmetric, and has a nice S-shape that captures how probabilities transition from impossible to certain.

If we plot these three functions side by side, they overlap so closely we'd need to zoom in to see the difference. The sigmoid climbs smoothly from near-zero to near-one as $x$ ranges from about $-6$ to $+6$ . The steepest part of the climb happens right at $x=0$ , where small changes in the input produce the biggest changes in probability. This is exactly what we want when modeling test responses: items near a person's ability level are the ones where their probability of success is most sensitive to small shifts in skill. alt text

Use case in IRT modeling: Because ability and difficulty live on an unbounded scale (any real number), but probabilities must stay between 0 and 1. The sigmoid family bridges that gap. When we write something like $P(\text{correct}) = \sigma(\theta - b)$ , we're saying: compute the distance between ability and difficulty, then smoothly convert that distance into a probability. Positive distance (ability exceeds difficulty) pushes the probability above 0.5. Negative distance pulls it below. The sigmoid handles the translation automatically, ensuring our model always outputs valid probabilities no matter what parameter values we estimate.

How to Model a Dataset

Let's assume a student, Kuzco, is sitting for a standardized test. The test has 30 questions, each targeting a different concept. Some questions are straightforward algebra, while others involve tricky geometry proofs. Kuzco has a certain level of mathematical ability (we don't know it yet, that's what we're trying to measure), and each question has its own level of difficulty (either unknown or known). What we observe is simple: Kuzco answers each question correctly or incorrectly (1/0). This means we have observable outcomes (0/1), and we want to infer hidden quantities: how strong Kuzco really is, and how hard each question really is. Once we know those, we can predict how Kuzco would perform on new questions, compare Kuzco fairly to other students, and figure out which questions are actually measuring what we think they're measuring.

Kuzco when facing an easy question. θ ≫ b — Kuzco when facing an easy question. Θ ≥ b

Kuzco when facing a medium question. θ ≈ b — Kuzco when facing a medium question. Θ ≈ b

Kuzco when facing a hard question. θ ≪ b — Kuzco when facing a hard question. Θ ≤ b

Now let's formalize this step by step by building IRT models.

Modeling Difficulty and Ability

The first question we may ask is: does Kuzco's ability exceed a question's difficulty? If Kuzco's ability $\theta$ is higher than question $j$ 's difficulty $b_j$ , we expect Kuzco to probably get it right. If $\theta$ is lower than $b_j$ , we expect Kuzco to probably get it wrong.

The Rasch model (1PL) captures this with elegant simplicity. The probability that Kuzco answers question $j$ correctly is:

P(Y_j = 1 \mid \theta, b_j) = \sigma(\theta - b_j) = \frac{1}{1 + e^{-(\theta - b_j)}}

Let's say Kuzco has ability $\theta = 1.2$ (on a standardized scale where 0 is average). Question 5 is moderately easy with $b_5 = -0.5$ . The difference is $1.2 - (-0.5) = 1.7$ , so $P(\text{correct}) = \sigma(1.7) \approx 0.85$ . Kuzco has a good chance of getting this one right.

Question 12, however, is quite hard with $b_{12} = 2.0$ . Now the difference is $1.2 - 2.0 = -0.8$ , giving $P(\text{correct}) = \sigma(-0.8) \approx 0.31$ . More likely than not, Kuzco will miss this question.

Notice what's beautiful here: we're not just counting raw scores. We're building a model that explains the pattern of which questions get answered correctly. Two students with the same raw score might have very different abilities if they got different subsets of questions right. Similarly, two questions with the same pass rate might differ in difficulty if they're passed by different groups of students.

The mathematical form $\sigma(\theta - b)$ has an intuitive interpretation. The quantity $\theta - b$ is the "distance" between ability and difficulty. When that distance is 0, ability exactly matches difficulty, and we get $\sigma(0) = 0.5$ , a coin flip. Every unit increase in $\theta - b$ shifts the probability upward along the sigmoid curve. The model assumes all questions discriminate equally well (we'll relax this shortly), and the only thing that differs between questions is where they sit on the difficulty scale.

Modeling Discrimination

Not all questions are created equal. Some questions are gentle slopes where a small increase in ability barely changes our odds of success. Others are sharp cliffs: students just below the threshold almost always fail, while students just above almost always pass.

This is where the discrimination parameter $a$ comes in. It controls how steeply the probability curve rises as ability increases. A high discrimination ( $a > 1$ ) means the question is good at separating students who know the material from those who don't. A low discrimination ( $a < 1$ ) means the question is less informative, perhaps because it's ambiguously worded or tests multiple unrelated skills at once.

The 2-parameter logistic model (2PL) incorporates discrimination:

P(Y_j = 1 \mid \theta, a_j, b_j) = \sigma(a_j(\theta - b_j)) = \frac{1}{1 + e^{-a_j(\theta - b_j)}}

Back to Kuzco. Question 8 has difficulty $b_8 = 1.0$ and discrimination $a_8 = 2.0$ . With Kuzco's ability $\theta = 1.2$ , we compute $a_8(\theta - b_8) = 2.0 \times (1.2 - 1.0) = 0.4$ , giving $P(\text{correct}) = \sigma(0.4) \approx 0.60$ . The high discrimination amplifies the small gap between Kuzco's ability and the question's difficulty.

Now consider question 15 with the same difficulty $b_{15} = 1.0$ but lower discrimination $a_{15} = 0.5$ . The gap is still $1.2 - 1.0 = 0.2$ , but now $a_{15}(\theta - b_{15}) = 0.5 \times 0.2 = 0.1$ , yielding $P(\text{correct}) = \sigma(0.1) \approx 0.52$ . The curve is flatter, so Kuzco's slight advantage barely moves the needle.

Discrimination tells us how informative a question is. High-discrimination items give us precise information about ability near their difficulty threshold. Low-discrimination items are noisier, they don't reliably distinguish between students at different ability levels. When we design a test, we want a mix: some high-discrimination items to pinpoint ability, and perhaps some lower-discrimination items to cover a wider range. But items with very low discrimination (say, $a < 0.5$ ) might be candidates for revision or removal, since they're not contributing much signal.

Modeling Guessing

On a multiple-choice test with four options, even someone who has no idea can guess correctly 25 percent of the time. The models we've built so far ignore this. They assume that as ability drops far below difficulty, the probability of success approaches zero. But that's not realistic when lucky guesses are possible.

The 3-parameter logistic model (3PL) adds a guessing parameter $c$ . This is the probability of getting the question right even with effectively zero ability:

P(Y_j = 1 \mid \theta, a_j, b_j, c_j) = c_j + (1 - c_j) \sigma(a_j(\theta - b_j))

With probability $c_j$ , we guess correctly regardless of ability. With probability $1 - c_j$ , the ability actually matters, and we use the 2PL curve to determine the odds.

Let's say question 3 is a four-option multiple choice, so we might set $c_3 = 0.25$ . It has difficulty $b_3 = 1.5$ and discrimination $a_3 = 1.5$ . For a very weak student with $\theta = -2.0$ , we compute $a_3(\theta - b_3) = 1.5 \times (-2.0 - 1.5) = -5.25$ . Without guessing, $\sigma(-5.25) \approx 0.005$ , nearly zero. But with the guessing floor: $P(\text{correct}) = 0.25 + 0.75 \times 0.005 \approx 0.254$ . The student still has that baseline 25 percent chance from random guessing.

For Kuzco with $\theta = 1.2$ , we get $a_3(\theta - b_3) = 1.5 \times (1.2 - 1.5) = -0.45$ , so $\sigma(-0.45) \approx 0.39$ . Then $P(\text{correct}) = 0.25 + 0.75 \times 0.39 \approx 0.54$ . The guessing parameter lifts the entire curve, but it matters most at the low end where ability-based success is unlikely.

The guessing parameter is most relevant for multiple-choice formats. For open-ended or constructed-response questions, we'd typically set $c = 0$ since there's no way to accidentally produce the correct answer. In practice, $c$ is often fixed based on the number of options (like 0.2 for five-choice, 0.25 for four-choice) rather than estimated from data, because estimating it freely can lead to unstable fits. Still, the 3PL gives us a more realistic model when guessing is genuinely part of the testing environment.

General Framework: 4PL IRT

We've added parameters for difficulty, discrimination, and guessing. There's one more twist: what if even a very strong student doesn't have a 100 percent chance of success? Maybe the question is slightly ambiguous, or there's a careless error that trips up even experts. The 4-parameter logistic model introduces an upper asymptote $d$ , setting a ceiling below 1:

P(Y_j = 1 \mid \theta, a_j, b_j, c_j, d_j) = c_j + (d_j - c_j) \sigma(a_j(\theta - b_j))

Now the curve ranges from $c_j$ (the guessing floor) to $d_j$ (the maximum attainable probability). When $c = 0$ and $d = 1$ , we recover the 2PL. When $c > 0$ and $d = 1$ , we get the 3PL. The full 4PL lets both ends of the curve depart from the ideal 0-to-1 range.

Let's say question 18 is tricky even for top students. We might have $b_{18} = 0.5$ , $a_{18} = 1.2$ , $c_{18} = 0.2$ (five-option multiple choice), and $d_{18} = 0.9$ (even experts miss it 10 percent of the time). For Kuzco at $\theta = 1.2$ , we compute $a_{18}(\theta - b_{18}) = 1.2 \times (1.2 - 0.5) = 0.84$ , so $\sigma(0.84) \approx 0.70$ . Then $P(\text{correct}) = 0.2 + (0.9 - 0.2) \times 0.70 = 0.2 + 0.49 = 0.69$ . Kuzco's odds are decent but capped below what the 2PL would predict.

The general form $c + (d - c) \sigma(a(\theta - b))$ unifies all the standard models. Setting different parameters to their default values gives us:

1PL (Rasch): $a = 1$ , $c = 0$ , $d = 1$ for all items
2PL: $c = 0$ , $d = 1$ for all items, but $a$ varies
3PL: $d = 1$ for all items, but $a$ and $c$ vary
4PL: All four parameters can vary

The choice depends on our data and goals. More parameters give more flexibility but require more data to estimate reliably. In practice, the 2PL is a workhorse for many applications, the 3PL is standard when guessing matters, and the 4PL is reserved for cases where we have strong theoretical reasons to expect an upper asymptote.

From Students to Models

It happens all too often: we run two models on the same dataset, the ranking flips with a different random seed, and our confidence fades with each rerun. Both Kuzco and Llama Kuzco can be modeled in the same way as different expressions of the same underlying ability. Psychometrics offers a useful perspective. In Item Response Theory, we treat models like test takers with a latent strength, $\theta$ , and questions like instruments calibrated along a difficulty scale. The result isn’t just another score; it’s a map linking ability to success, item by item.

Suppose we have $L$ models, let's call them $\{\text{LLM}_1, \ldots, \text{LLM}_L\}$ , and a dataset of $M$ questions. For binary grading, we record $Y_{\ell j} \in \{0, 1\}$ : model $\ell$ solves question $j$ or not. If we use multiple attempts per item (say, $N$ non-deterministic samples), we can keep them all or compress to counts. What matters is that beneath these outcomes lives a smooth curve, the probability of success as a function of ability and item parameters (Lord, 1980). That curve is our first character in the story.

Now let's imagine lining up our models on a single axis, low $\theta$ on the left, high $\theta$ on the right. Each question stakes a flag on that line: "I become solvable around here." Easy questions plant their flags on the left, hard ones on the right. In the simplest model (1PL/Rasch), all items discriminate equally, they're just positioned at different difficulties. A more flexible model, the 2PL, lets items vary in how sharply they distinguish strong from weak models (the discrimination parameter $a$ ). If there's multiple-choice guessing, the curve never drops to zero, that's the $c$ term in the 3PL. Fit the curves, and we get more than a leaderboard: we get a map of where each model lives, which questions truly distinguish them, and how the dataset behaves across the whole spectrum of abilities.

In practice, we start with the binary families. The 1-parameter Rasch model pins every item to a difficulty $b_j$ and gives each model an ability $\theta_\ell$ (Rasch, 1980), with success probability:

$P(Y_{\ell j} = 1 \mid \theta_\ell, b_j) = \sigma(\theta_\ell - b_j), \qquad \sigma(x) = \frac{1}{1 + e^{-x}}$

The 2-parameter model sharpens the picture with a discrimination slope $a_j$ (Birnbaum, 1968; Lord, 1980):

$P(Y_{\ell j} = 1 \mid \theta_\ell, a_j, b_j) = \sigma(a_j(\theta_\ell - b_j))$

When options allow lucky breaks, the 3-parameter version adds a guessing floor $c_j$ (Lord, 1980):

$P(Y_{\ell j} = 1 \mid \theta_\ell, a_j, b_j, c_j) = c_j + (1 - c_j) \sigma(a_j(\theta_\ell - b_j))$

We fit these by maximizing a (possibly binomial) likelihood or placing priors and doing Bayesian inference (Lord, 1980; van der Linden & Hambleton, 1997). Because $\theta$ is only defined up to an affine transform, we fix the scale (for example, mean zero and unit variance). The reward for this calibration is clarity: $b_j$ ranks difficulty across items, $a_j$ tells us which questions meaningfully separate close models, and $\theta$ situates each model on a common ruler so comparisons stop wobbling with decoding noise.

Rubric-based grading extends the same logic. In Samejima's Graded Response Model, each item has ordered thresholds $b_{j1} < \cdots < b_{jC}$ and a common slope $a_j$ (Samejima, 1969). The cumulative chance of earning at least category $k$ is:

$P(Y_{\ell j} \ge k \mid \theta_\ell, a_j, b_{jk}) = \sigma(a_j(\theta_\ell - b_{jk})), \quad k = 1, \ldots, C$

and the category probability comes from adjacent differences. The expected score $\mathbb{E}[Y_{\ell j} \mid \theta_\ell]$ , normalized by the maximum category value, plays the role of "probability correct."

Information curves add the second voice to our narrative: reliability. For the 2PL, the item Fisher information at ability $\theta$ is (van der Linden & Hambleton, 1997):

$I_j(\theta) = a_j^2 P(\theta)(1 - P(\theta))$

with $P(\theta) = \sigma(a_j(\theta - b_j))$ . Summing across items gives the test information function, showing where our dataset truly measures ability well, and where it goes quiet. This is how we design stable "Easy/Medium/Hard" splits without guessing.

An end-to-end pass looks like this: assemble responses from $L$ models over $M$ items, choose a family (1PL/2PL/3PL for binary, GRM for rubrics), fit with identifiability constraints, then audit model-item fit and report calibrated parameters with uncertainties (Chen et al., 2021; van der Linden & Hambleton, 1997). The upshot is a dataset that doesn't just score models, it locates them and tells us what to add next.

Simulating IRT Curves

Below we can adjust item parameters (difficulty $b$ ; discrimination $a$ for 2PL/3PL/GRM; guessing $c$ for 3PL), switch among 1PL/2PL/3PL and GRM, and paste a list of model abilities to see expected performance. Watch how steeper discrimination slices the pack more sharply, how $c$ lifts the floor, and how GRM turns categories into a smooth expected score.

Birnbaum, A. (1968). Some latent trait models and their use in inferring an examinee’s ability. In F. M. Lord & M. R. Novick (Eds.), Statistical Theories of Mental Test Scores. Addison-Wesley.

Chen, Y., Li, X., Liu, J., & Ying, Z. (2021). Item Response Theory A Statistical Framework for Educational and Psychological Measurement. arXiv Preprint arXiv:2108.08604.

Lord, F. M. (1980). Applications of Item Response Theory to Practical Testing Problems. Routledge.

Rasch, G. (1980). Probabilistic Models for Some Intelligence and Attainment Tests. The University of Chicago Press.

Samejima, F. (1969). Estimation of latent ability using a response pattern of graded scores. Psychometrika Monograph Supplement, 34(4, Pt. 2).

van der Linden, W. J., & Hambleton, R. K. (1997). Handbook of Modern Item Response Theory. Springer.