Global health model benchmark

Live OpenRouter benchmark results with numeric scoring, LLM-judge rubric grades, token usage, and cost metadata.

Best model
Qwen3 235B A22B 2507
85% mean score
Overall mean
78%
30 tasks across 5 model views
Numeric pass
73%
12 exact-answer questions in scope
Avg tokens
310
Mean total tokens per model/question attempt
Leaderboard
Mean score under the active filters
30 tasks
Token breakdown
Average generated tokens by model
Task family profile
Grouped scores by question abstraction
Gpt 4O Mini
Gemini 2.0 Flash Lite 001
Claude 3 Haiku
Mistral Small 3.2 24B Instruct
Qwen3 235B A22B 2507
Question filters
Per-question breakdown
Each row shows the score range; open the question page for the full model-by-model breakdown.
QuestionTypeBestWorstSpreadAvg tokens
gh_numeric_001 · pkpd fundamentalsUsing AUC = F * Dose / CL, compute AUC when bioavailability F = 0.8, dose = 500 mg, and c...
Numeric
Gpt 4O Mini
100%
Qwen3 235B A22B 2507
100%
0%225
gh_numeric_002 · pkpd fundamentalsUsing t1/2 = ln(2) * V / CL, compute the half-life when V = 50 L and CL = 5 L/hour. Use l...
Numeric
Gpt 4O Mini
100%
Claude 3 Haiku
0%
100%271
gh_numeric_003 · pkpd response modelsFor an EMAX model E = E0 + (EMAX * C) / (EC50 + C), compute E when E0 = 0, EMAX = 90, EC5...
Numeric
Gpt 4O Mini
100%
Qwen3 235B A22B 2507
100%
0%255
gh_numeric_004 · pkpd response modelsFor a sigmoid EMAX model E = E0 + (EMAX * C^gamma) / (EC50^gamma + C^gamma), compute E wh...
Numeric
Gpt 4O Mini
100%
Qwen3 235B A22B 2507
100%
0%329
gh_numeric_005 · population pkClearance follows CL_i = CL_typ * (WT_i / 70)^0.75. If CL_typ = 8 L/hour and WT_i = 35 kg...
Numeric
Gemini 2.0 Flash Lite 001
100%
Qwen3 235B A22B 2507
0%
100%291
gh_numeric_006 · population pkVolume follows V_i = V_typ * (WT_i / 70)^1. If V_typ = 50 L and WT_i = 14 kg, what is V_i?
Numeric
Gpt 4O Mini
100%
Qwen3 235B A22B 2507
100%
0%270
gh_numeric_007 · covariate modellingIn a one-parameter stepwise covariate screen, use a forward-inclusion threshold of Delta...
Numeric
Gpt 4O Mini
100%
Qwen3 235B A22B 2507
100%
0%223
gh_numeric_008 · maximum likelihoodFor X ~ Binomial(n = 100, p = 0.3), compute P(X = 30). Give the probability to six decima...
Numeric
Gpt 4O Mini
0%
Qwen3 235B A22B 2507
0%
0%366
gh_numeric_009 · maximum likelihoodFor X ~ Binomial(n = 100, p = 0.3), compute P(X >= 30). Give the probability to six decim...
Numeric
Gpt 4O Mini
0%
Qwen3 235B A22B 2507
0%
0%373
gh_numeric_010 · model comparisonAIC is defined as 2 * NLL + 2 * k. Compute AIC when the minimized NLL is 48.3 and the mod...
Numeric
Gpt 4O Mini
100%
Claude 3 Haiku
0%
100%189
gh_numeric_011 · model comparisonFor nested models, compute the likelihood ratio statistic 2 * (NLL_restricted - NLL_full)...
Numeric
Gpt 4O Mini
100%
Qwen3 235B A22B 2507
100%
0%220
gh_numeric_012 · uncertaintyFor a one-parameter likelihood confidence interval, use minimum NLL + 1.92 as the 95% cut...
Numeric
Gpt 4O Mini
100%
Qwen3 235B A22B 2507
100%
0%222
gh_subjective_013 · model fittingDefine model fitting in the context of mechanistic or phenomenological models and observe...
Short answer
Gpt 4O Mini
100%
Qwen3 235B A22B 2507
100%
0%276
gh_subjective_014 · least squaresExplain the iterative logic shared by least squares and maximum likelihood fitting.
Short answer
Gemini 2.0 Flash Lite 001
100%
Gpt 4O Mini
90%
10%353
gh_subjective_015 · maximum likelihoodExplain the difference between probability P(D | theta) and likelihood L(theta | D).
Short answer
Gemini 2.0 Flash Lite 001
100%
Claude 3 Haiku
60%
40%292
gh_subjective_016 · maximum likelihoodWhy do likelihood-based workflows often use log likelihoods or negative log likelihoods?
Short answer
Gpt 4O Mini
100%
Claude 3 Haiku
70%
30%322
gh_subjective_017 · identifiabilityWhat does an identifiability problem mean in model fitting, and why does it matter for in...
Short answer
Gpt 4O Mini
90%
Qwen3 235B A22B 2507
90%
0%292
gh_subjective_018 · bayesian inferenceState Bayes' theorem in words and explain the roles of prior, likelihood, posterior, and...
Short answer
Gpt 4O Mini
100%
Qwen3 235B A22B 2507
100%
0%347
gh_subjective_019 · bayesian inferenceGive two advantages and two disadvantages or cautions of Bayesian methods for infectious...
Short answer
Gpt 4O Mini
100%
Qwen3 235B A22B 2507
100%
0%290
gh_subjective_020 · bayesian workflowWhat is a prior predictive check, and what question is it designed to answer?
Short answer
Gemini 2.0 Flash Lite 001
100%
Claude 3 Haiku
70%
30%247
gh_subjective_021 · bayesian workflowWhat is a posterior predictive check, and how does it differ from a prior predictive check?
Short answer
Qwen3 235B A22B 2507
100%
Mistral Small 3.2 24B Instruct
90%
10%324
gh_subjective_022 · hmc diagnosticsList key MCMC or HMC diagnostics used in Bayesian workflows and explain what values or pa...
Short answer
Qwen3 235B A22B 2507
100%
Mistral Small 3.2 24B Instruct
70%
30%372
gh_subjective_023 · hmc diagnosticsWhat can cause divergent transitions in HMC, and what are reasonable responses?
Short answer
Qwen3 235B A22B 2507
83%
Mistral Small 3.2 24B Instruct
67%
17%351
gh_subjective_024 · pkpd strategyCompare concentration-response, exposure-response, and dose-response strategies in PK-PD...
Short answer
Gemini 2.0 Flash Lite 001
100%
Mistral Small 3.2 24B Instruct
70%
30%367
gh_subjective_025 · population pkWhat are the main components of a nonlinear mixed effects population PK model?
Short answer
Claude 3 Haiku
100%
Mistral Small 3.2 24B Instruct
90%
10%352
gh_applied_026 · dose optimisationA malaria dose-optimisation study finds that small children have lower drug exposure unde...
Applied
Qwen3 235B A22B 2507
69%
Mistral Small 3.2 24B Instruct
54%
15%384
gh_applied_027 · model diagnosticsA population PK model has plausible parameter estimates, but CWRES versus time shows a cl...
Applied
Qwen3 235B A22B 2507
100%
Claude 3 Haiku
58%
42%359
gh_design_028 · bayesian workflowSketch a Bayesian workflow for fitting an epidemic model that will be used for decision s...
Critique/design
Qwen3 235B A22B 2507
88%
Gpt 4O Mini
50%
38%378
gh_design_029 · individual based modelsIn an individual-based SEIR simulation, you want to add a hospitalised status for infecti...
Critique/design
Gpt 4O Mini
47%
Claude 3 Haiku
40%
7%378
gh_design_030 · model selectionYou compare two infectious disease models and one has lower AIC, while the other has bett...
Critique/design
Qwen3 235B A22B 2507
88%
Mistral Small 3.2 24B Instruct
56%
31%382