Playground v3 · CounterFact

Single Policy Evaluation

Evaluate one candidate policy against historical logs. Get detailed diagnostics, confidence intervals, and a comprehensive audit passport.

Detailed single-policy diagnostics
Gate & CI analysis
Stress testing & robustness checks
Full audit passport

Start Single Policy Evaluation →

Multi-Policy Comparison (Leaderboard)

Compare multiple candidate policies side-by-side. Rank by performance, compare pairwise differences, and identify the best policy.

Rank multiple policies by uplift
Side-by-side comparison
Pairwise difference analysis
Per-policy diagnostics

Start Leaderboard Comparison →

Candidate File Generation

Generate candidate policy files from your historical logs. Score items using uploaded scores, API calls, or ONNX models and create evaluation-ready files.

Upload precomputed item scores
Score via external API
Score via ONNX model
Download ready-to-evaluate files

Generate Candidate File →

Logging Readiness

Audit your historical logs before evaluation. Check whether your data supports reliable OPE.

Auto-detect scenario (bandit vs ranking)
Propensity health & coverage checks
Availability & candidate-set logging audit
Actionable recommendations

Check Logging Readiness →

CounterFact OPE Playground

Single Policy Evaluation

Multi-Policy Comparison (Leaderboard)

Candidate File Generation

Logging Readiness