Overview
The Playground helps you evaluate candidate recommendation or ranking policies using historical logs. The workflow is designed so you can answer a simple question quickly: "Is this candidate policy safe and worth shipping, based on offline evidence?" You will typically do three things: confirm your logs are usable, prepare a candidate file, then run evaluation and review the decision and diagnostics.
What You Need Before You Start
You will work with two file types (CSV or Parquet). A historical logs file contains what happened under your current (logging) policy. A candidate policy file contains what your new policy would recommend, expressed as user_id, item_id, and score columns. If you do not have a candidate file yet, the Playground can generate one for you.
Your logs should include user and item identifiers and an outcome column (reward). If you have exposure probabilities (propensity), include them, as they are required for confidence intervals, ESS, and stress tests. If your system is ranking-based, logging decision_id (or a similar grouping key) and position improves what the evaluator can tell you, but the Playground will guide you based on what it detects. Optional columns also include timestamp for time-based diagnostics.
Step 1: Logging Readiness
Start with Logging Readiness when you are unsure whether a logs file is suitable for off-policy evaluation. Upload your historical logs and run the audit. The Readiness report is meant to be fast and practical. It tells you whether the file looks like serving logs (as opposed to a candidate score file), whether the core ingredients for value estimation are present, and what limitations apply given the shape of the data.
If required columns are missing or named differently, use the column mapping controls and re-run the audit. The report distinguishes between row-level logs (one row per decision, typical of bandit systems) and grouped ranking logs (multiple items per decision, detected via position column or multi-item decision grouping). It will also call out when certain analyses are limited, for example when retrieval availability is not logged in a ranking system.
Treat Readiness as a gate that prevents wasting time on an evaluation that cannot be trusted or cannot be computed. A "Blocked" status means reward or propensity is missing or unusable. "Limitations" are surfaced separately and do not prevent evaluation.
Step 2: Candidate File Generation
You need a candidate policy file before running evaluation. If you already have one, you can skip to Single Policy Evaluation. If not, Candidate File Generation helps you produce an evaluation-ready candidates file from your model and your logs.
Candidate generation supports two paths today. You can upload precomputed scores you already produced offline. You can also score using an ONNX model if you have exported the model to ONNX format. The system runs inference locally via ONNX Runtime (CPU) without executing arbitrary code. A third mode (score via external API) is planned but not yet available.
In all cases, the output is a candidate file aligned to your logs and packaged with lightweight artifacts that make the run reproducible and auditable:
- candidates.csv: the scored candidate file (
user_id,item_id,score) - availability_manifest.json: detected availability source and set-size statistics
- candidate_alignment.json: overlap between logs and candidates
- candidate_passport.yaml: audit document with file hashes, alignment metrics, and warnings
Candidate generation typically scores the user-item pairs present in your logs. This is often sufficient for evaluating a re-ranking policy on logged items. If your production system has a separate retrieval stage and your logs do not include retrieval availability, candidate generation cannot invent the items that would have been retrieved. In those cases, you can still evaluate within the logged support, but you should not interpret the result as a full end-to-end retrieval+ranking replacement unless you have the availability signals needed to justify that claim.
Step 3: Single Policy Evaluation
Single Policy Evaluation is the fastest path from "I have logs and a candidate" to a decision. Upload your logs and a single candidate file and let the Playground auto-detect columns. If mapping is needed, configure it once and proceed. You can keep defaults for most runs and only adjust advanced settings (gating thresholds, propensity estimation) when you have a specific reason.
Run the evaluation and focus on the Summary Board first. The summary communicates the decision state (SHIP / BLOCK / INCONCLUSIVE), the point estimate, confidence interval, uplift, and effective sample size. Use the Gate & Diagnostics tab next to understand whether the estimate is well-supported. When confidence intervals or robustness checks are unavailable, the page explains the specific missing prerequisite rather than implying the policy is bad.
Other tabs provide deeper detail: CI & Bounds for statistical intervals, Passport for the full audit document (estimator choice, badges, reproducibility identifiers), Stress Tests for robustness under simulated constraints, and Power & MDE for sample-size adequacy. If you have configured multiple metrics, a multi-metric summary and constraint recommendation card will appear above the standard summary.
Step 4: Multi-Policy Leaderboard
Use the Leaderboard when you want to compare multiple candidates side-by-side on the same logs. Upload one logs file and two or more candidate files. The Playground evaluates them consistently using the same bootstrap seed and ranks them by lower confidence bound (LCB), a conservative criterion intended to reduce false wins.
Use the focus controls to inspect any candidate in detail, then use Pairwise Diffs when you need a direct head-to-head view. If the confidence interval for the pairwise difference does not include zero, there is strong evidence that one policy is better. Treat leaderboard results as a shortlist generator and use the single-policy view for deeper inspection of the winner.
All tabs from single-policy evaluation (Gate & Diagnostics, CI & Bounds, Passport, Stress Tests, Power & MDE) are available per-policy in the leaderboard. Switch between policies using the Focus Policy dropdown or by clicking a row in the ranking table.
Optional: Exploration and Segmentation Diagnostics
If you provide segment columns (via the "Segment by" field in evaluation settings), the Playground can show where OPE is weak and which segments are driving uncertainty. This is most useful when you are deciding whether to collect more data, broaden exploration, or constrain the policy rollout to safer regions.
The output highlights the segments that limit reliability rather than producing a long statistical report. You will see a "Where OPE is weak" table with per-segment match rate, ESS, and weight-tail statistics, and a "Suggested exploration targets" list with ranked segments. If the segment columns are not present in your logs, the evaluation still runs normally but skips the exploration output with a non-blocking warning.
Outputs and Reproducibility
Each run produces downloadable artifacts so you can reproduce results and share them with teammates. Every evaluation includes a config hash, bootstrap seed, and run identifier. The Passport tab provides a structured record of the configuration, estimator choice, and quality signals. Use these artifacts when you need to revisit a decision later or compare results across candidates with consistent settings.
Troubleshooting
If the Playground cannot compute a confidence interval or refuses to run a check, it will usually be due to missing prerequisites in the logs, mismatched column mapping, or insufficient support overlap. Start by revisiting Logging Readiness and the diagnostics section of the run, and confirm that reward and propensity are mapped as intended.
If you see "CI: N/A" or "ESS: unavailable," check whether propensities are present or enable "Estimate propensities if missing" in advanced settings. If the decision is INCONCLUSIVE, the Gate & Diagnostics tab will explain which check failed. If you suspect key collisions because the same user can see the same item multiple times, add a decision_id column to your logs and candidates when possible.
Common issues with candidate generation: ONNX mode requires onnxruntime to be installed. Low overlap warnings after upload usually mean user/item IDs have different formats between files or the candidate file covers a different time period than the logs.
Best Practices
Run Logging Readiness on new datasets before spending time on candidate generation or multi-policy comparisons. Prefer true logging propensities when you have them. Keep candidate files aligned in schema and time range when comparing multiple policies. Use the summary to decide, and the diagnostics to justify why the result should be trusted.