Simulating LLM Evaluation Datasets Using Psychometric Models

TL;DR: Psychometric models such as IRT can be used to simulate and calibrate LLM benchmarks. They estimate model ability, as well as item difficulty, discrimination, and guessing. This supports graded scoring and multi-sample evaluation, helping build stable, interpretable, and uncertainty-aware datasets that guide better benchmark design.

Background

In psychometric modeling, we oftern need to turn scores into probabilities. We feed in a score (e.g., ability minus difficulty), and out comes a sensible probability of success. Three common smooth squashing functions do that job and show up in IRT:

logistic:σ(x)=11+exp(x)(0,1),\text{logistic:}\quad \sigma(x)=\frac{1}{1+\exp(-x)}\in(0,1), tanh rescaled:τ(x)=1+tanh(x)2(0,1),\text{tanh rescaled:}\quad \tau(x)=\frac{1+\tanh(x)}{2}\in(0,1), probit:Φ(x)=x12πexp(t22)dt(0,1).\text{probit:}\quad \Phi(x)=\int_{-\infty}^{x}\frac{1}{\sqrt{2\pi}}\exp\bigg(-\frac{t^2}{2}\bigg)dt\in(0,1).

All rise smoothly, center at one half when x=0x=0, and approach 0 or 1 in the tails. IRT practitioners often use the logistic because it has a simple closed form and nice derivative properties. It's also called sigmoid: σ(x)=11+ex\sigma(x) = \frac{1}{1 + e^{-x}}.

When xx is large and positive, this approaches 1. When xx is large and negative, it approaches 0. At x=0x=0, we get exactly 0.5. The curve is smooth, symmetric, and has a nice S-shape that captures how probabilities transition from impossible to certain.

If we plot these three functions side by side, they overlap so closely we'd need to zoom in to see the difference. The sigmoid climbs smoothly from near-zero to near-one as xx ranges from about 6-6 to +6+6. The steepest part of the climb happens right at x=0x=0, where small changes in the input produce the biggest changes in probability. This is exactly what we want when modeling test responses: items near a person's ability level are the ones where their probability of success is most sensitive to small shifts in skill. alt text

Use case in IRT modeling: Because ability and difficulty live on an unbounded scale (any real number), but probabilities must stay between 0 and 1. The sigmoid family bridges that gap. When we write something like P(correct)=σ(θb)P(\text{correct}) = \sigma(\theta - b), we're saying: compute the distance between ability and difficulty, then smoothly convert that distance into a probability. Positive distance (ability exceeds difficulty) pushes the probability above 0.5. Negative distance pulls it below. The sigmoid handles the translation automatically, ensuring our model always outputs valid probabilities no matter what parameter values we estimate.

How to Model a Dataset

Let's assume a student, Kuzco, is sitting for a standardized test. The test has 30 questions, each targeting a different concept. Some questions are straightforward algebra, while others involve tricky geometry proofs. Kuzco has a certain level of mathematical ability (we don't know it yet, that's what we're trying to measure), and each question has its own level of difficulty (either unknown or known). What we observe is simple: Kuzco answers each question correctly or incorrectly (1/0). This means we have observable outcomes (0/1), and we want to infer hidden quantities: how strong Kuzco really is, and how hard each question really is. Once we know those, we can predict how Kuzco would perform on new questions, compare Kuzco fairly to other students, and figure out which questions are actually measuring what we think they're measuring.

Kuzco when facing an easy question. θ ≫ b
Kuzco when facing an easy question. Θ ≥ b
Kuzco when facing a medium question. θ ≈ b
Kuzco when facing a medium question. Θ ≈ b
Kuzco when facing a hard question. θ ≪ b
Kuzco when facing a hard question. Θ ≤ b

Now let's formalize this step by step by building IRT models.

Modeling Difficulty and Ability

The first question we may ask is: does Kuzco's ability exceed a question's difficulty? If Kuzco's ability θ\theta is higher than question jj's difficulty bjb_j, we expect Kuzco to probably get it right. If θ\theta is lower than bjb_j, we expect Kuzco to probably get it wrong.

The Rasch model (1PL) captures this with elegant simplicity. The probability that Kuzco answers question jj correctly is:

P(Yj=1θ,bj)=σ(θbj)=11+e(θbj)P(Y_j = 1 \mid \theta, b_j) = \sigma(\theta - b_j) = \frac{1}{1 + e^{-(\theta - b_j)}}

Let's say Kuzco has ability θ=1.2\theta = 1.2 (on a standardized scale where 0 is average). Question 5 is moderately easy with b5=0.5b_5 = -0.5. The difference is 1.2(0.5)=1.71.2 - (-0.5) = 1.7, so P(correct)=σ(1.7)0.85P(\text{correct}) = \sigma(1.7) \approx 0.85. Kuzco has a good chance of getting this one right.

Question 12, however, is quite hard with b12=2.0b_{12} = 2.0. Now the difference is 1.22.0=0.81.2 - 2.0 = -0.8, giving P(correct)=σ(0.8)0.31P(\text{correct}) = \sigma(-0.8) \approx 0.31. More likely than not, Kuzco will miss this question.

Notice what's beautiful here: we're not just counting raw scores. We're building a model that explains the pattern of which questions get answered correctly. Two students with the same raw score might have very different abilities if they got different subsets of questions right. Similarly, two questions with the same pass rate might differ in difficulty if they're passed by different groups of students.

The mathematical form σ(θb)\sigma(\theta - b) has an intuitive interpretation. The quantity θb\theta - b is the "distance" between ability and difficulty. When that distance is 0, ability exactly matches difficulty, and we get σ(0)=0.5\sigma(0) = 0.5, a coin flip. Every unit increase in θb\theta - b shifts the probability upward along the sigmoid curve. The model assumes all questions discriminate equally well (we'll relax this shortly), and the only thing that differs between questions is where they sit on the difficulty scale.

Modeling Discrimination

Not all questions are created equal. Some questions are gentle slopes where a small increase in ability barely changes our odds of success. Others are sharp cliffs: students just below the threshold almost always fail, while students just above almost always pass.

This is where the discrimination parameter aa comes in. It controls how steeply the probability curve rises as ability increases. A high discrimination (a>1a > 1) means the question is good at separating students who know the material from those who don't. A low discrimination (a<1a < 1) means the question is less informative, perhaps because it's ambiguously worded or tests multiple unrelated skills at once.

The 2-parameter logistic model (2PL) incorporates discrimination:

P(Yj=1θ,aj,bj)=σ(aj(θbj))=11+eaj(θbj)P(Y_j = 1 \mid \theta, a_j, b_j) = \sigma(a_j(\theta - b_j)) = \frac{1}{1 + e^{-a_j(\theta - b_j)}}

Back to Kuzco. Question 8 has difficulty b8=1.0b_8 = 1.0 and discrimination a8=2.0a_8 = 2.0. With Kuzco's ability θ=1.2\theta = 1.2, we compute a8(θb8)=2.0×(1.21.0)=0.4a_8(\theta - b_8) = 2.0 \times (1.2 - 1.0) = 0.4, giving P(correct)=σ(0.4)0.60P(\text{correct}) = \sigma(0.4) \approx 0.60. The high discrimination amplifies the small gap between Kuzco's ability and the question's difficulty.

Now consider question 15 with the same difficulty b15=1.0b_{15} = 1.0 but lower discrimination a15=0.5a_{15} = 0.5. The gap is still 1.21.0=0.21.2 - 1.0 = 0.2, but now a15(θb15)=0.5×0.2=0.1a_{15}(\theta - b_{15}) = 0.5 \times 0.2 = 0.1, yielding P(correct)=σ(0.1)0.52P(\text{correct}) = \sigma(0.1) \approx 0.52. The curve is flatter, so Kuzco's slight advantage barely moves the needle.

Discrimination tells us how informative a question is. High-discrimination items give us precise information about ability near their difficulty threshold. Low-discrimination items are noisier, they don't reliably distinguish between students at different ability levels. When we design a test, we want a mix: some high-discrimination items to pinpoint ability, and perhaps some lower-discrimination items to cover a wider range. But items with very low discrimination (say, a<0.5a < 0.5) might be candidates for revision or removal, since they're not contributing much signal.

Modeling Guessing

On a multiple-choice test with four options, even someone who has no idea can guess correctly 25 percent of the time. The models we've built so far ignore this. They assume that as ability drops far below difficulty, the probability of success approaches zero. But that's not realistic when lucky guesses are possible.

The 3-parameter logistic model (3PL) adds a guessing parameter cc. This is the probability of getting the question right even with effectively zero ability:

P(Yj=1θ,aj,bj,cj)=cj+(1cj)σ(aj(θbj))P(Y_j = 1 \mid \theta, a_j, b_j, c_j) = c_j + (1 - c_j) \sigma(a_j(\theta - b_j))

With probability cjc_j, we guess correctly regardless of ability. With probability 1cj1 - c_j, the ability actually matters, and we use the 2PL curve to determine the odds.

Let's say question 3 is a four-option multiple choice, so we might set c3=0.25c_3 = 0.25. It has difficulty b3=1.5b_3 = 1.5 and discrimination a3=1.5a_3 = 1.5. For a very weak student with θ=2.0\theta = -2.0, we compute a3(θb3)=1.5×(2.01.5)=5.25a_3(\theta - b_3) = 1.5 \times (-2.0 - 1.5) = -5.25. Without guessing, σ(5.25)0.005\sigma(-5.25) \approx 0.005, nearly zero. But with the guessing floor: P(correct)=0.25+0.75×0.0050.254P(\text{correct}) = 0.25 + 0.75 \times 0.005 \approx 0.254. The student still has that baseline 25 percent chance from random guessing.

For Kuzco with θ=1.2\theta = 1.2, we get a3(θb3)=1.5×(1.21.5)=0.45a_3(\theta - b_3) = 1.5 \times (1.2 - 1.5) = -0.45, so σ(0.45)0.39\sigma(-0.45) \approx 0.39. Then P(correct)=0.25+0.75×0.390.54P(\text{correct}) = 0.25 + 0.75 \times 0.39 \approx 0.54. The guessing parameter lifts the entire curve, but it matters most at the low end where ability-based success is unlikely.

The guessing parameter is most relevant for multiple-choice formats. For open-ended or constructed-response questions, we'd typically set c=0c = 0 since there's no way to accidentally produce the correct answer. In practice, cc is often fixed based on the number of options (like 0.2 for five-choice, 0.25 for four-choice) rather than estimated from data, because estimating it freely can lead to unstable fits. Still, the 3PL gives us a more realistic model when guessing is genuinely part of the testing environment.

General Framework: 4PL IRT

We've added parameters for difficulty, discrimination, and guessing. There's one more twist: what if even a very strong student doesn't have a 100 percent chance of success? Maybe the question is slightly ambiguous, or there's a careless error that trips up even experts. The 4-parameter logistic model introduces an upper asymptote dd, setting a ceiling below 1:

P(Yj=1θ,aj,bj,cj,dj)=cj+(djcj)σ(aj(θbj))P(Y_j = 1 \mid \theta, a_j, b_j, c_j, d_j) = c_j + (d_j - c_j) \sigma(a_j(\theta - b_j))

Now the curve ranges from cjc_j (the guessing floor) to djd_j (the maximum attainable probability). When c=0c = 0 and d=1d = 1, we recover the 2PL. When c>0c > 0 and d=1d = 1, we get the 3PL. The full 4PL lets both ends of the curve depart from the ideal 0-to-1 range.

Let's say question 18 is tricky even for top students. We might have b18=0.5b_{18} = 0.5, a18=1.2a_{18} = 1.2, c18=0.2c_{18} = 0.2 (five-option multiple choice), and d18=0.9d_{18} = 0.9 (even experts miss it 10 percent of the time). For Kuzco at θ=1.2\theta = 1.2, we compute a18(θb18)=1.2×(1.20.5)=0.84a_{18}(\theta - b_{18}) = 1.2 \times (1.2 - 0.5) = 0.84, so σ(0.84)0.70\sigma(0.84) \approx 0.70. Then P(correct)=0.2+(0.90.2)×0.70=0.2+0.49=0.69P(\text{correct}) = 0.2 + (0.9 - 0.2) \times 0.70 = 0.2 + 0.49 = 0.69. Kuzco's odds are decent but capped below what the 2PL would predict.

The general form c+(dc)σ(a(θb))c + (d - c) \sigma(a(\theta - b)) unifies all the standard models. Setting different parameters to their default values gives us:

  • 1PL (Rasch): a=1a = 1, c=0c = 0, d=1d = 1 for all items
  • 2PL: c=0c = 0, d=1d = 1 for all items, but aa varies
  • 3PL: d=1d = 1 for all items, but aa and cc vary
  • 4PL: All four parameters can vary

The choice depends on our data and goals. More parameters give more flexibility but require more data to estimate reliably. In practice, the 2PL is a workhorse for many applications, the 3PL is standard when guessing matters, and the 4PL is reserved for cases where we have strong theoretical reasons to expect an upper asymptote.

From Students to Models

It happens all too often: we run two models on the same dataset, the ranking flips with a different random seed, and our confidence fades with each rerun. Both Kuzco and Llama Kuzco can be modeled in the same way as different expressions of the same underlying ability. Psychometrics offers a useful perspective. In Item Response Theory, we treat models like test takers with a latent strength, θ\theta, and questions like instruments calibrated along a difficulty scale. The result isn’t just another score; it’s a map linking ability to success, item by item.

Suppose we have LL models, let's call them {LLM1,,LLML}\{\text{LLM}_1, \ldots, \text{LLM}_L\}, and a dataset of MM questions. For binary grading, we record Yj{0,1}Y_{\ell j} \in \{0, 1\}: model \ell solves question jj or not. If we use multiple attempts per item (say, NN non-deterministic samples), we can keep them all or compress to counts. What matters is that beneath these outcomes lives a smooth curve, the probability of success as a function of ability and item parameters (Lord, 1980). That curve is our first character in the story.

Now let's imagine lining up our models on a single axis, low θ\theta on the left, high θ\theta on the right. Each question stakes a flag on that line: "I become solvable around here." Easy questions plant their flags on the left, hard ones on the right. In the simplest model (1PL/Rasch), all items discriminate equally, they're just positioned at different difficulties. A more flexible model, the 2PL, lets items vary in how sharply they distinguish strong from weak models (the discrimination parameter aa). If there's multiple-choice guessing, the curve never drops to zero, that's the cc term in the 3PL. Fit the curves, and we get more than a leaderboard: we get a map of where each model lives, which questions truly distinguish them, and how the dataset behaves across the whole spectrum of abilities.

In practice, we start with the binary families. The 1-parameter Rasch model pins every item to a difficulty bjb_j and gives each model an ability θ\theta_\ell (Rasch, 1980), with success probability:

P(Yj=1θ,bj)=σ(θbj),σ(x)=11+exP(Y_{\ell j} = 1 \mid \theta_\ell, b_j) = \sigma(\theta_\ell - b_j), \qquad \sigma(x) = \frac{1}{1 + e^{-x}}

The 2-parameter model sharpens the picture with a discrimination slope aja_j (Birnbaum, 1968; Lord, 1980):

P(Yj=1θ,aj,bj)=σ(aj(θbj))P(Y_{\ell j} = 1 \mid \theta_\ell, a_j, b_j) = \sigma(a_j(\theta_\ell - b_j))

When options allow lucky breaks, the 3-parameter version adds a guessing floor cjc_j (Lord, 1980):

P(Yj=1θ,aj,bj,cj)=cj+(1cj)σ(aj(θbj))P(Y_{\ell j} = 1 \mid \theta_\ell, a_j, b_j, c_j) = c_j + (1 - c_j) \sigma(a_j(\theta_\ell - b_j))

We fit these by maximizing a (possibly binomial) likelihood or placing priors and doing Bayesian inference (Lord, 1980; van der Linden & Hambleton, 1997). Because θ\theta is only defined up to an affine transform, we fix the scale (for example, mean zero and unit variance). The reward for this calibration is clarity: bjb_j ranks difficulty across items, aja_j tells us which questions meaningfully separate close models, and θ\theta situates each model on a common ruler so comparisons stop wobbling with decoding noise.

Rubric-based grading extends the same logic. In Samejima's Graded Response Model, each item has ordered thresholds bj1<<bjCb_{j1} < \cdots < b_{jC} and a common slope aja_j (Samejima, 1969). The cumulative chance of earning at least category kk is:

P(Yjkθ,aj,bjk)=σ(aj(θbjk)),k=1,,CP(Y_{\ell j} \ge k \mid \theta_\ell, a_j, b_{jk}) = \sigma(a_j(\theta_\ell - b_{jk})), \quad k = 1, \ldots, C

and the category probability comes from adjacent differences. The expected score E[Yjθ]\mathbb{E}[Y_{\ell j} \mid \theta_\ell], normalized by the maximum category value, plays the role of "probability correct."

Information curves add the second voice to our narrative: reliability. For the 2PL, the item Fisher information at ability θ\theta is (van der Linden & Hambleton, 1997):

Ij(θ)=aj2P(θ)(1P(θ))I_j(\theta) = a_j^2 P(\theta)(1 - P(\theta))

with P(θ)=σ(aj(θbj))P(\theta) = \sigma(a_j(\theta - b_j)). Summing across items gives the test information function, showing where our dataset truly measures ability well, and where it goes quiet. This is how we design stable "Easy/Medium/Hard" splits without guessing.

An end-to-end pass looks like this: assemble responses from LL models over MM items, choose a family (1PL/2PL/3PL for binary, GRM for rubrics), fit with identifiability constraints, then audit model-item fit and report calibrated parameters with uncertainties (Chen et al., 2021; van der Linden & Hambleton, 1997). The upshot is a dataset that doesn't just score models, it locates them and tells us what to add next.

Simulating IRT Curves

Below we can adjust item parameters (difficulty bb; discrimination aa for 2PL/3PL/GRM; guessing cc for 3PL), switch among 1PL/2PL/3PL and GRM, and paste a list of model abilities to see expected performance. Watch how steeper discrimination slices the pack more sharply, how cc lifts the floor, and how GRM turns categories into a smooth expected score.

Birnbaum, A. (1968). Some latent trait models and their use in inferring an examinee’s ability. In F. M. Lord & M. R. Novick (Eds.), Statistical Theories of Mental Test Scores. Addison-Wesley.
Chen, Y., Li, X., Liu, J., & Ying, Z. (2021). Item Response Theory A Statistical Framework for Educational and Psychological Measurement. arXiv Preprint arXiv:2108.08604.
Lord, F. M. (1980). Applications of Item Response Theory to Practical Testing Problems. Routledge.
Rasch, G. (1980). Probabilistic Models for Some Intelligence and Attainment Tests. The University of Chicago Press.
Samejima, F. (1969). Estimation of latent ability using a response pattern of graded scores. Psychometrika Monograph Supplement, 34(4, Pt. 2).
van der Linden, W. J., & Hambleton, R. K. (1997). Handbook of Modern Item Response Theory. Springer.