This repository contains two packages:
scorio
(Python) - Python implementationScorio.jl
(Julia) - Julia implementation# Install from PyPI
pip install scorio
# Install from repository
pip install -e .
import numpy as np
from scorio import eval
# Outcomes R: shape (M, N) with integer categories in {0, ..., C}
R = np.array([[0, 1, 2, 2, 1],
[1, 1, 0, 2, 2]])
# Rubric weights w: length C+1
# Here: 0=incorrect(0.0), 1=partial(0.5), 2=correct(1.0)
w = np.array([0.0, 0.5, 1.0])
# Optional prior outcomes R0: shape (M, D)
R0 = np.array([[0, 2],
[1, 2]])
# Bayesian evaluation with prior
mu, sigma = eval.bayes(R, w, R0)
print(f"μ = {mu:.6f}, σ = {sigma:.6f}")
# Expected: μ ≈ 0.575, σ ≈ 0.084275
# Bayesian evaluation without prior
mu2, sigma2 = eval.bayes(R, w)
print(f"μ = {mu2:.6f}, σ = {sigma2:.6f}")
# Expected: μ ≈ 0.5625, σ ≈ 0.091998
# Simple average
accuracy = eval.avg(R)
print(f"Average: {accuracy:.6f}")
using Pkg
# From local development
Pkg.develop(path="./julia/Scorio.jl")
# Or from Julia General Registry (comming soon)
# Pkg.add("Scorio")
using Scorio
# Outcomes R: shape (M, N) with integer categories in {0, ..., C}
R = [0 1 2 2 1;
1 1 0 2 2]
# Rubric weights w: length C+1
# Here: 0=incorrect(0.0), 1=partial(0.5), 2=correct(1.0)
w = [0.0, 0.5, 1.0]
# Optional prior outcomes R0: shape (M, D)
R0 = [0 2;
1 2]
# Bayesian evaluation with prior
mu, sigma = bayes(R, w, R0)
println("μ = $mu, σ = $sigma")
# Expected: μ ≈ 0.575, σ ≈ 0.084275
# Bayesian evaluation without prior
mu2, sigma2 = bayes(R, w)
println("μ = $mu2, σ = $sigma2")
# Expected: μ ≈ 0.5625, σ ≈ 0.091998
# Simple average
accuracy = avg(R)
println("Average: $accuracy")
bayes(R, w, R0=None)
Bayesian performance evaluation with uncertainty quantification using the Bayes@N framework.
R
: M × N
integer matrix with entries in {0, ..., C}
(outcomes for M systems over N trials)w
: length C+1
float vector of rubric weights mapping categories to scoresR0
(optional): M × D
integer matrix of prior outcomes(mu, sigma)
- posterior estimate and uncertainty{0, ..., C}
w
of length C+1
(e.g., [0, 1]
for binary outcomes)R
is M × N
(M systems, N trials)R0
is M × D
(M systems, D prior trials)M
and category setFull documentation is available at: https://mohsenhariri.github.io/bayes-kit/
If you use Bayes-Kit in your research, please cite:
@article{bayeskit2025,
title={Don't Pass@k: A Bayesian Framework for Large Language Model Evaluation},
author={Hariri, Mohsen and Samandar, Amirhossein},
journal={arXiv preprint arXiv:2504.11651},
year={2025}
}
This project is licensed under the MIT License - see the LICENSE file for details.