Scorio.jl
Scorio — Module
ScorioA Julia package implementing the Bayes@N framework for evaluating Large Language Models.
Based on the paper: "Don't Pass@k: A Bayesian Framework for Large Language Model Evaluation" https://arxiv.org/abs/2510.04265
Main APIs
Evaluation Functions (from eval submodule)
bayes(R, w, R0=nothing): Bayesian performance evaluation with uncertainty quantificationavg(R): Simple average of outcomespass_at_k(R, k): Unbiased Pass@k estimatorpass_hat_k(R, k): Pass^k (Pass-hat@k) estimatorg_pass_at_k(R, k): Alias for passhatkg_pass_at_k_tao(R, k, tao): Generalized Pass@k with threshold τmg_pass_at_k(R, k): mean Generalized Pass@k
Ranking Functions (from rank submodule)
elo(): ELO ranking (not yet implemented)
Utility Functions
competition_ranks_from_scores(scores): Compute competition ranks from scores
Example Usage
using Scorio
# Outcomes R: shape (M, N) with integer categories in {0, ..., C}
R = [0 1 2 2 1;
1 1 0 2 2]
# Rubric weights w: length C+1. Here: 0=incorrect, 1=partial(0.5), 2=correct(1.0)
w = [0.0, 0.5, 1.0]
# Optional prior outcomes R0: shape (M, D). If omitted, D=0.
R0 = [0 2;
1 2]
# With prior (D=2 → T=10)
mu, sigma = bayes(R, w, R0)
println("μ = $mu, σ = $sigma") # Expected: μ ≈ 0.575, σ ≈ 0.084275
# Without prior (D=0 → T=8)
mu2, sigma2 = bayes(R, w)
println("μ = $mu2, σ = $sigma2") # Expected: μ ≈ 0.5625, σ ≈ 0.091998
# Simple average
accuracy = avg(R)
println("Average: $accuracy")Installation
Once registered, install with:
using Pkg
Pkg.add("Scorio")For development installation:
using Pkg
Pkg.develop(path="/path/to/Scorio.jl")