Global health model benchmark

Best model

Qwen3 235B A22B 2507

85% mean score

Overall mean

78%

30 tasks across 5 model views

Numeric pass

73%

12 exact-answer questions in scope

Avg tokens

310

Mean total tokens per model/question attempt

Leaderboard

Mean score under the active filters

30 tasks

Token breakdown

Average generated tokens by model

Task family profile

Grouped scores by question abstraction

Gpt 4O Mini

Gemini 2.0 Flash Lite 001

Claude 3 Haiku

Mistral Small 3.2 24B Instruct

Qwen3 235B A22B 2507

Question filters

Per-question breakdown

Each row shows the score range; open the question page for the full model-by-model breakdown.

Question	Type	Best	Worst	Spread	Avg tokens
gh_numeric_001 · pkpd fundamentalsUsing AUC = F * Dose / CL, compute AUC when bioavailability F = 0.8, dose = 500 mg, and c...	Numeric	Gpt 4O Mini 100%	Qwen3 235B A22B 2507 100%	0%	225
gh_numeric_002 · pkpd fundamentalsUsing t1/2 = ln(2) * V / CL, compute the half-life when V = 50 L and CL = 5 L/hour. Use l...	Numeric	Gpt 4O Mini 100%	Claude 3 Haiku 0%	100%	271
gh_numeric_003 · pkpd response modelsFor an EMAX model E = E0 + (EMAX * C) / (EC50 + C), compute E when E0 = 0, EMAX = 90, EC5...	Numeric	Gpt 4O Mini 100%	Qwen3 235B A22B 2507 100%	0%	255
gh_numeric_004 · pkpd response modelsFor a sigmoid EMAX model E = E0 + (EMAX * C^gamma) / (EC50^gamma + C^gamma), compute E wh...	Numeric	Gpt 4O Mini 100%	Qwen3 235B A22B 2507 100%	0%	329
gh_numeric_005 · population pkClearance follows CL_i = CL_typ * (WT_i / 70)^0.75. If CL_typ = 8 L/hour and WT_i = 35 kg...	Numeric	Gemini 2.0 Flash Lite 001 100%	Qwen3 235B A22B 2507 0%	100%	291
gh_numeric_006 · population pkVolume follows V_i = V_typ * (WT_i / 70)^1. If V_typ = 50 L and WT_i = 14 kg, what is V_i?	Numeric	Gpt 4O Mini 100%	Qwen3 235B A22B 2507 100%	0%	270
gh_numeric_007 · covariate modellingIn a one-parameter stepwise covariate screen, use a forward-inclusion threshold of Delta...	Numeric	Gpt 4O Mini 100%	Qwen3 235B A22B 2507 100%	0%	223
gh_numeric_008 · maximum likelihoodFor X ~ Binomial(n = 100, p = 0.3), compute P(X = 30). Give the probability to six decima...	Numeric	Gpt 4O Mini 0%	Qwen3 235B A22B 2507 0%	0%	366
gh_numeric_009 · maximum likelihoodFor X ~ Binomial(n = 100, p = 0.3), compute P(X >= 30). Give the probability to six decim...	Numeric	Gpt 4O Mini 0%	Qwen3 235B A22B 2507 0%	0%	373
gh_numeric_010 · model comparisonAIC is defined as 2 * NLL + 2 * k. Compute AIC when the minimized NLL is 48.3 and the mod...	Numeric	Gpt 4O Mini 100%	Claude 3 Haiku 0%	100%	189
gh_numeric_011 · model comparisonFor nested models, compute the likelihood ratio statistic 2 * (NLL_restricted - NLL_full)...	Numeric	Gpt 4O Mini 100%	Qwen3 235B A22B 2507 100%	0%	220
gh_numeric_012 · uncertaintyFor a one-parameter likelihood confidence interval, use minimum NLL + 1.92 as the 95% cut...	Numeric	Gpt 4O Mini 100%	Qwen3 235B A22B 2507 100%	0%	222
gh_subjective_013 · model fittingDefine model fitting in the context of mechanistic or phenomenological models and observe...	Short answer	Gpt 4O Mini 100%	Qwen3 235B A22B 2507 100%	0%	276
gh_subjective_014 · least squaresExplain the iterative logic shared by least squares and maximum likelihood fitting.	Short answer	Gemini 2.0 Flash Lite 001 100%	Gpt 4O Mini 90%	10%	353
gh_subjective_015 · maximum likelihoodExplain the difference between probability P(D \| theta) and likelihood L(theta \| D).	Short answer	Gemini 2.0 Flash Lite 001 100%	Claude 3 Haiku 60%	40%	292
gh_subjective_016 · maximum likelihoodWhy do likelihood-based workflows often use log likelihoods or negative log likelihoods?	Short answer	Gpt 4O Mini 100%	Claude 3 Haiku 70%	30%	322
gh_subjective_017 · identifiabilityWhat does an identifiability problem mean in model fitting, and why does it matter for in...	Short answer	Gpt 4O Mini 90%	Qwen3 235B A22B 2507 90%	0%	292
gh_subjective_018 · bayesian inferenceState Bayes' theorem in words and explain the roles of prior, likelihood, posterior, and...	Short answer	Gpt 4O Mini 100%	Qwen3 235B A22B 2507 100%	0%	347
gh_subjective_019 · bayesian inferenceGive two advantages and two disadvantages or cautions of Bayesian methods for infectious...	Short answer	Gpt 4O Mini 100%	Qwen3 235B A22B 2507 100%	0%	290
gh_subjective_020 · bayesian workflowWhat is a prior predictive check, and what question is it designed to answer?	Short answer	Gemini 2.0 Flash Lite 001 100%	Claude 3 Haiku 70%	30%	247
gh_subjective_021 · bayesian workflowWhat is a posterior predictive check, and how does it differ from a prior predictive check?	Short answer	Qwen3 235B A22B 2507 100%	Mistral Small 3.2 24B Instruct 90%	10%	324
gh_subjective_022 · hmc diagnosticsList key MCMC or HMC diagnostics used in Bayesian workflows and explain what values or pa...	Short answer	Qwen3 235B A22B 2507 100%	Mistral Small 3.2 24B Instruct 70%	30%	372
gh_subjective_023 · hmc diagnosticsWhat can cause divergent transitions in HMC, and what are reasonable responses?	Short answer	Qwen3 235B A22B 2507 83%	Mistral Small 3.2 24B Instruct 67%	17%	351
gh_subjective_024 · pkpd strategyCompare concentration-response, exposure-response, and dose-response strategies in PK-PD...	Short answer	Gemini 2.0 Flash Lite 001 100%	Mistral Small 3.2 24B Instruct 70%	30%	367
gh_subjective_025 · population pkWhat are the main components of a nonlinear mixed effects population PK model?	Short answer	Claude 3 Haiku 100%	Mistral Small 3.2 24B Instruct 90%	10%	352
gh_applied_026 · dose optimisationA malaria dose-optimisation study finds that small children have lower drug exposure unde...	Applied	Qwen3 235B A22B 2507 69%	Mistral Small 3.2 24B Instruct 54%	15%	384
gh_applied_027 · model diagnosticsA population PK model has plausible parameter estimates, but CWRES versus time shows a cl...	Applied	Qwen3 235B A22B 2507 100%	Claude 3 Haiku 58%	42%	359
gh_design_028 · bayesian workflowSketch a Bayesian workflow for fitting an epidemic model that will be used for decision s...	Critique/design	Qwen3 235B A22B 2507 88%	Gpt 4O Mini 50%	38%	378
gh_design_029 · individual based modelsIn an individual-based SEIR simulation, you want to add a hospitalised status for infecti...	Critique/design	Gpt 4O Mini 47%	Claude 3 Haiku 40%	7%	378
gh_design_030 · model selectionYou compare two infectious disease models and one has lower AIC, while the other has bett...	Critique/design	Qwen3 235B A22B 2507 88%	Mistral Small 3.2 24B Instruct 56%	31%	382