Global health model benchmark
Live OpenRouter benchmark results with numeric scoring, LLM-judge rubric grades, token usage, and cost metadata.
Best model
Qwen3 235B A22B 2507
85% mean score
Overall mean
78%
30 tasks across 5 model views
Numeric pass
73%
12 exact-answer questions in scope
Avg tokens
310
Mean total tokens per model/question attempt
Leaderboard
Mean score under the active filters
30 tasks
Token breakdown
Average generated tokens by model
Task family profile
Grouped scores by question abstraction
Gpt 4O Mini
Gemini 2.0 Flash Lite 001
Claude 3 Haiku
Mistral Small 3.2 24B Instruct
Qwen3 235B A22B 2507
Question filters
Per-question breakdown
Each row shows the score range; open the question page for the full model-by-model breakdown.
| Question | Type | Best | Worst | Spread | Avg tokens |
|---|---|---|---|---|---|
gh_numeric_001 · pkpd fundamentalsUsing AUC = F * Dose / CL, compute AUC when bioavailability F = 0.8, dose = 500 mg, and c... | Numeric | Gpt 4O Mini 100% | Qwen3 235B A22B 2507 100% | 0% | 225 |
gh_numeric_002 · pkpd fundamentalsUsing t1/2 = ln(2) * V / CL, compute the half-life when V = 50 L and CL = 5 L/hour. Use l... | Numeric | Gpt 4O Mini 100% | Claude 3 Haiku 0% | 100% | 271 |
gh_numeric_003 · pkpd response modelsFor an EMAX model E = E0 + (EMAX * C) / (EC50 + C), compute E when E0 = 0, EMAX = 90, EC5... | Numeric | Gpt 4O Mini 100% | Qwen3 235B A22B 2507 100% | 0% | 255 |
gh_numeric_004 · pkpd response modelsFor a sigmoid EMAX model E = E0 + (EMAX * C^gamma) / (EC50^gamma + C^gamma), compute E wh... | Numeric | Gpt 4O Mini 100% | Qwen3 235B A22B 2507 100% | 0% | 329 |
gh_numeric_005 · population pkClearance follows CL_i = CL_typ * (WT_i / 70)^0.75. If CL_typ = 8 L/hour and WT_i = 35 kg... | Numeric | Gemini 2.0 Flash Lite 001 100% | Qwen3 235B A22B 2507 0% | 100% | 291 |
gh_numeric_006 · population pkVolume follows V_i = V_typ * (WT_i / 70)^1. If V_typ = 50 L and WT_i = 14 kg, what is V_i? | Numeric | Gpt 4O Mini 100% | Qwen3 235B A22B 2507 100% | 0% | 270 |
gh_numeric_007 · covariate modellingIn a one-parameter stepwise covariate screen, use a forward-inclusion threshold of Delta... | Numeric | Gpt 4O Mini 100% | Qwen3 235B A22B 2507 100% | 0% | 223 |
gh_numeric_008 · maximum likelihoodFor X ~ Binomial(n = 100, p = 0.3), compute P(X = 30). Give the probability to six decima... | Numeric | Gpt 4O Mini 0% | Qwen3 235B A22B 2507 0% | 0% | 366 |
gh_numeric_009 · maximum likelihoodFor X ~ Binomial(n = 100, p = 0.3), compute P(X >= 30). Give the probability to six decim... | Numeric | Gpt 4O Mini 0% | Qwen3 235B A22B 2507 0% | 0% | 373 |
gh_numeric_010 · model comparisonAIC is defined as 2 * NLL + 2 * k. Compute AIC when the minimized NLL is 48.3 and the mod... | Numeric | Gpt 4O Mini 100% | Claude 3 Haiku 0% | 100% | 189 |
gh_numeric_011 · model comparisonFor nested models, compute the likelihood ratio statistic 2 * (NLL_restricted - NLL_full)... | Numeric | Gpt 4O Mini 100% | Qwen3 235B A22B 2507 100% | 0% | 220 |
gh_numeric_012 · uncertaintyFor a one-parameter likelihood confidence interval, use minimum NLL + 1.92 as the 95% cut... | Numeric | Gpt 4O Mini 100% | Qwen3 235B A22B 2507 100% | 0% | 222 |
gh_subjective_013 · model fittingDefine model fitting in the context of mechanistic or phenomenological models and observe... | Short answer | Gpt 4O Mini 100% | Qwen3 235B A22B 2507 100% | 0% | 276 |
gh_subjective_014 · least squaresExplain the iterative logic shared by least squares and maximum likelihood fitting. | Short answer | Gemini 2.0 Flash Lite 001 100% | Gpt 4O Mini 90% | 10% | 353 |
gh_subjective_015 · maximum likelihoodExplain the difference between probability P(D | theta) and likelihood L(theta | D). | Short answer | Gemini 2.0 Flash Lite 001 100% | Claude 3 Haiku 60% | 40% | 292 |
gh_subjective_016 · maximum likelihoodWhy do likelihood-based workflows often use log likelihoods or negative log likelihoods? | Short answer | Gpt 4O Mini 100% | Claude 3 Haiku 70% | 30% | 322 |
gh_subjective_017 · identifiabilityWhat does an identifiability problem mean in model fitting, and why does it matter for in... | Short answer | Gpt 4O Mini 90% | Qwen3 235B A22B 2507 90% | 0% | 292 |
gh_subjective_018 · bayesian inferenceState Bayes' theorem in words and explain the roles of prior, likelihood, posterior, and... | Short answer | Gpt 4O Mini 100% | Qwen3 235B A22B 2507 100% | 0% | 347 |
gh_subjective_019 · bayesian inferenceGive two advantages and two disadvantages or cautions of Bayesian methods for infectious... | Short answer | Gpt 4O Mini 100% | Qwen3 235B A22B 2507 100% | 0% | 290 |
gh_subjective_020 · bayesian workflowWhat is a prior predictive check, and what question is it designed to answer? | Short answer | Gemini 2.0 Flash Lite 001 100% | Claude 3 Haiku 70% | 30% | 247 |
gh_subjective_021 · bayesian workflowWhat is a posterior predictive check, and how does it differ from a prior predictive check? | Short answer | Qwen3 235B A22B 2507 100% | Mistral Small 3.2 24B Instruct 90% | 10% | 324 |
gh_subjective_022 · hmc diagnosticsList key MCMC or HMC diagnostics used in Bayesian workflows and explain what values or pa... | Short answer | Qwen3 235B A22B 2507 100% | Mistral Small 3.2 24B Instruct 70% | 30% | 372 |
gh_subjective_023 · hmc diagnosticsWhat can cause divergent transitions in HMC, and what are reasonable responses? | Short answer | Qwen3 235B A22B 2507 83% | Mistral Small 3.2 24B Instruct 67% | 17% | 351 |
gh_subjective_024 · pkpd strategyCompare concentration-response, exposure-response, and dose-response strategies in PK-PD... | Short answer | Gemini 2.0 Flash Lite 001 100% | Mistral Small 3.2 24B Instruct 70% | 30% | 367 |
gh_subjective_025 · population pkWhat are the main components of a nonlinear mixed effects population PK model? | Short answer | Claude 3 Haiku 100% | Mistral Small 3.2 24B Instruct 90% | 10% | 352 |
gh_applied_026 · dose optimisationA malaria dose-optimisation study finds that small children have lower drug exposure unde... | Applied | Qwen3 235B A22B 2507 69% | Mistral Small 3.2 24B Instruct 54% | 15% | 384 |
gh_applied_027 · model diagnosticsA population PK model has plausible parameter estimates, but CWRES versus time shows a cl... | Applied | Qwen3 235B A22B 2507 100% | Claude 3 Haiku 58% | 42% | 359 |
gh_design_028 · bayesian workflowSketch a Bayesian workflow for fitting an epidemic model that will be used for decision s... | Critique/design | Qwen3 235B A22B 2507 88% | Gpt 4O Mini 50% | 38% | 378 |
gh_design_029 · individual based modelsIn an individual-based SEIR simulation, you want to add a hospitalised status for infecti... | Critique/design | Gpt 4O Mini 47% | Claude 3 Haiku 40% | 7% | 378 |
gh_design_030 · model selectionYou compare two infectious disease models and one has lower AIC, while the other has bett... | Critique/design | Qwen3 235B A22B 2507 88% | Mistral Small 3.2 24B Instruct 56% | 31% | 382 |