Playground Tutorial · CounterFact

Overview

The CounterFact Playground is an interactive tool for evaluating recommendation and ranking policies using historical logs. The playground supports two evaluation modes: Single Policy Evaluation (deep dive into one candidate policy) and Multi-Policy Leaderboard (compares and ranks multiple candidate policies). This platform helps you:

Evaluate candidate policies before deploying them to production
Compare multiple policies side-by-side to identify the best performer
Understand statistical confidence in your evaluation results
Validate robustness through stress tests (when available) and data-health diagnostics
Produce reproducible evaluation artifacts (config hash, seeds, and a structured Passport)

Getting Started

Prerequisites

Before using the playground, you will need two types of files: a Historical Logs File which contains user interactions with your current (logging) policy, and one or more Candidate Policy Files which contain recommendations from your new candidate policy. Currently we support both .csv and .parquet formats.

The Historical Logs file should contain the following columns: user_id, item_id, reward, and optionally propensity (exposure probability), position (ranking position), and timestamp (for temporal analysis). Candidate Policy Files should contain: user_id, item_id, score (prediction/score), and optionally timestamp (for temporal analysis).

Note that reward may have various metrics depending on your specific system. For instance, binary metrics (click/no-click, purchase/no-purchase) use 0 or 1, while continuous values like ratings (1-5 stars, 0-10 scale) use numeric ranges. The playground will auto-detect your metric type, but you can specify it manually if needed.

Accessing the Playground

Navigate to /playground in your CounterFact instance. You will see two options: Single Policy Evaluation for evaluating one candidate, and Multi-Policy Comparison (Leaderboard) for comparing 2 or more candidates.

Single Policy Evaluation

Step 1: Upload Your Data

Click "Choose File" under "Historical Logs (required)" and select your logs file. This file contains interactions from your current policy. Then click "Choose File" under "Candidate Policy (required)" and select your candidate policy file. This file contains recommendations from your new policy. Wait for auto-detection, as the system automatically detects column names and data types. A "Configure Column Mapping" button appears if mapping is needed.

Step 2: Configure Column Mapping (If Needed)

If your column names do not match standard names, click "Configure Column Mapping". The system will present two sections: one for Historical Logs and one for Candidate Policy. For Historical Logs, map your columns to user_id (user identifier), item_id (item/product identifier), reward (outcome metric), and optionally propensity (exposure probability) and position (ranking position). For Candidate Policy, map your columns to user_id, item_id, score (prediction or recommendation score), and optionally timestamp. The auto-mapping system suggests mappings based on common synonyms (for example, movieId → item_id, rating → reward). Review the suggestions and click "Save" to apply mappings.

Step 3: Configure Advanced Settings (Optional)

Expand the collapsible sections to adjust settings if needed. Gating Thresholds control how results are categorized into decision states. The default settings work well for most cases: Min Overlap (20%) determines how much your candidate policy's recommendations overlap with logged data, Min ESS (1000) sets the minimum effective sample size for reliable confidence intervals, Min Uplift (0.5%) specifies the minimum improvement needed to qualify as a SHIP signal, and Harm Threshold (0.5%) sets the harm level that triggers a BLOCK signal.

Propensity Estimation helps when your logs do not include exposure probabilities. If you check "Estimate propensities if missing," the system will calculate them from your data using softmax over scores (from the score/rating column) or uniform per-user distribution (fallback). This enables confidence intervals and statistical tests, but estimated propensities are less reliable than true ones. Use this option only if you do not have propensity data from your logging system. For now, the default settings are recommended.

Step 4: Run Evaluation

Click the "Run Selected" button at the bottom. A progress bar appears showing the evaluation status. Depending on your data size, this typically takes few seconds to few minutes. When processing completes, the page updates with your results across multiple tabs.

Step 5: Understanding Results

After evaluation completes, you will see results in several tabs and one Summary Board.

Summary Board displays the Decision state (SHIP, BLOCK, or INCONCLUSIVE), Point Estimate (expected reward value), CI (95% Confidence Interval [lower, upper]), ESS (Post-clip) (Effective Sample Size after importance weighting), Baseline Mean (average reward from logging policy), Uplift Point (difference between candidate and baseline), Uplift LCB (Lower Confidence Bound for uplift), and Config Hash (reproducibility identifier). Use the Decision state as a recommendation signal alongside diagnostics: SHIP indicates the evidence clears your configured bar, BLOCK indicates evidence of likely harm under your harm threshold, and INCONCLUSIVE means the evidence (or data support) is not strong enough yet.

Gate & Diagnostics Tab shows the data-quality checks and diagnostic signals that explain reliability. You will see Overlap Metrics (how much the candidate policy overlaps with logged data), ESS (effective sample size, where higher generally means more reliable), Gate Status (which checks passed/failed), and Thresholds Used (the actual threshold values applied). If your decision was INCONCLUSIVE, this tab usually explains why (e.g. overlap was 12% where you needed 20%, or ESS was 750 where you needed 1000).

CI & Bounds Tab shows confidence intervals and statistical bounds. You will see the Point Estimate (expected reward), Confidence Interval (95% CI for reward), Uplift (difference between candidate and baseline), and Bounds (Uplift) (conservative bounds when CI unavailable). If CI shows as "N/A," the system explains why. Common reasons include propensities were not available, overlap was too low, or ESS was too small. When CI is not available, the system provides Observed-Support Bounds instead, where conservative estimates are based on the overlapping region between your candidate and logged data.

Passport Tab is your comprehensive audit document showing every aspect of the evaluation. Base Evaluation summarizes the core numbers: point estimate, CI, LCB, and sample size. Estimator Details explains which statistical method was used (IPS, SNIPS, DR) and why it was chosen based on your data characteristics. Badges provide quality signals: Propensity Confidence (PC) indicates whether your propensity scores look reliable (for example: Good / Limited / None; exact values depend on your evaluation output), and Sensitivity shows how sensitive results are to the clipping threshold. CI Method documents the bootstrap details: how many resamples (B=600 by default), what scheme (paired), and the random seed. Stress Test Results show how your candidate performs under adverse conditions. Reproducibility provides the config hash, bootstrap seed, and data counts for recreating the exact evaluation later.

Stress Tests Tab validates robustness by simulating constraints. Inventory Caps evaluates what happens if the most popular items become less available. The system identifies popular items from logged interactions, applies a cap, and recomputes performance. For each stress test, you will see a Severity label (as returned by the evaluation output), the baseline vs stressed lower bounds, and a note that stress tests use the same propensities as the main evaluation when available. Use this tab to spot policies that look good in baseline conditions but become fragile under stress.

Power & MDE Tab helps you understand whether your sample size is adequate for detecting meaningful effects. MDE (Uplift) is the Minimum Detectable Effect, which is the smallest uplift you can reliably detect with 80% power. Power for Δmin shows your statistical power for detecting your configured minimum uplift threshold (aim for at least 80% power). Uplift CI Width indicates precision. While narrower intervals mean more precise estimates, wide intervals suggest high variance in your importance weights. ESS shows both your current effective sample size and what you would need for adequate power.

Multi-Policy Leaderboard

Step 1: Upload Your Data

Upload your Historical Logs file once. This single logs file is used for all policy comparisons. Then click "Choose File" under "Candidate Policies (≥2 required)" and select your first candidate policy file. Click "Choose File" again and add your second candidate. Repeat for as many policies as you want to compare (minimum 2). You will see a list of selected files below the input. All files should have the same schema.

Step 2: Configure Settings

Column mapping works the same way for all files under candidate policy. You only need to configure once and it applies to all candidate files (they should have the same schema). Configure gating thresholds and propensity estimation as needed, using the same options as single policy evaluation.

Step 3: Run Evaluation

Click "Run Selected" to evaluate all policies. The system evaluates all policies using the same logs and bootstrap seed, ensuring fair comparison. This takes slightly longer since multiple policies are being processed, but you will see progress for each one. Policies are automatically ranked by performance.

Step 4: Understanding Leaderboard Results

When evaluation completes, the Summary Board shows the top-ranked policy by default. Use the "Focus Policy" dropdown to view any policy's summary, or toggle to "Aggregate view" to see overall statistics across all policies.

Leaderboard Tab displays a ranked table showing Rank (performance ranking where 1 is best), Policy (the filename like policy1.csv, policy2.csv), LCB (Lower Confidence Bound used for ranking), Point (point estimate), CI Status (ok, skipped_gates, or error), Overlap (support overlap percentage), ESS (Effective Sample Size), and Badges (CI status indicators). Policies are ranked by LCB (lower confidence bound for uplift), not point estimates. This is conservative and it ranks policies by the improvement you can confidently expect. Click any row to focus the Summary Board on that policy.

Pairwise Diffs Tab lets you compare two policies directly. You will see side-by-side policy cards with key metrics for each policy (point, LCB, overlap, ESS), and a Difference Card showing the statistical difference between them: Point Diff (expected difference in reward), Uplift Diff (difference in uplift over baseline), and CI for Diff (confidence interval for the difference using paired bootstrap). If the CI for the difference does not include zero, there is strong evidence that one policy is truly better. Note that pairwise comparisons use paired bootstrap resampling but do not apply multiple-testing corrections, so treat these as exploratory.

Other Tabs show the same information as single policy evaluation, but on a per-policy basis: Gate & Diagnostics (per-policy gate status), CI & Bounds (per-policy confidence intervals), Passport (per-policy audit documents), Stress Tests (per-policy robustness validation), and Power & MDE (per-policy power analysis). Use the "Focus Policy" dropdown in the Summary Board to switch between policies, or click a row in the Leaderboard table.

Understanding Key Concepts

Decision States

Decision States reflect your configured thresholds and data-quality checks. SHIP indicates the evidence clears your configured bar (uplift LCB ≥ Δmin). BLOCK indicates evidence of likely harm under your configured harm threshold (uplift LCB ≤ -Δharm). INCONCLUSIVE means the evidence (or data support) is not strong enough yet—often because data-quality checks fail (e.g. low overlap, low ESS) or because uplift LCB falls between -Δharm and Δmin.

Confidence Intervals (CI)

Confidence Intervals (CI) quantify uncertainty. A 95% confidence interval means that if you repeated this evaluation 100 times with different bootstrap samples, 95 times the true value would fall within this range. CI Lower is the conservative lower bound, CI Upper is the conservative upper bound, and LCB (Lower Confidence Bound) is used for decision-making (CI Lower for uplift). The width tells you about precision: narrow intervals mean precise estimates, wide intervals mean high uncertainty.

Effective Sample Size (ESS)

Effective Sample Size (ESS) measures how much information your weighted sample contains. Higher ESS means more reliable results. ESS (Post-clip) is calculated after importance weighting and clipping. The Min ESS Gate requires ESS ≥ threshold (default: 1000). If ESS is too low, results may be unreliable. ESS adjusts for the fact that importance weighting does not treat all samples equally. If your raw sample has 10000 observations but ESS is 1000, only 1000 "effective" observations are contributing meaningfully after weighting.

Uplift

Uplift equals Candidate Performance minus Baseline Performance. Uplift Point is the expected difference, Uplift LCB is the conservative lower bound for difference. Positive uplift means the candidate is better than baseline, while negative uplift means it is worse. We focus on uplift LCB for decisions because it is a conservative measure. You are 95% confident the uplift is at least this large.

Propensities

Propensities represent the probability that an item was shown to a user under the logging policy. They are required for importance-weighted uncertainty diagnostics such as confidence intervals, ESS calculation, and stress tests. If missing, you can enable "Estimate propensities if missing" (less reliable than true logging propensities). Best practice is to use true propensities from your logging system. Without propensities (or propensity estimation), CI/ESS/stress outputs may be unavailable.

Stress Tests

Stress Tests validate robustness by asking "what if?" They simulate constraints like popular items becoming less available and check whether your policy still performs well. Inventory Caps is one example stress scenario. The result shows how the lower bound changes under stress, along with a Severity label (as returned by the evaluation output) and a reason when a stress test cannot be run.

Advanced Features

Column Mapping

Column Mapping allows you to work with non-standard column names. If your files use different naming conventions, click "Configure Column Mapping" after uploading files. The system will map your columns to required fields and auto-suggest mappings based on common synonyms (for example, movieId → item_id, rating → reward).

Propensity Estimation

Propensity Estimation helps when true propensities are not available. Enable "Estimate propensities if missing" in Advanced settings. The system uses softmax over scores (if score/rating column exists) or uniform per-user distribution (fallback). Warning: estimated propensities are less reliable than true ones, so use with caution.

Gating Thresholds

Gating Thresholds can be adjusted based on your risk tolerance. Lower thresholds are more permissive (easier to SHIP), while higher thresholds are more conservative (harder to SHIP). Default values work for most use cases.

Reproducibility

Reproducibility is ensured through several identifiers. Every evaluation includes a Config Hash (unique identifier for the configuration), Bootstrap Seed (ensures deterministic results), and Run ID (unique identifier for this evaluation run). Use these to reproduce results or share with your team.

Troubleshooting

"CI: N/A" or "ESS: — (unavailable)"

This indicates that propensities are missing or weights are not being used. Check the Gate & Diagnostics tab. If it says "Weights used: No," you need propensities. Solutions include:

Mapping a propensity column in Column Mapping if your logs have one
Enabling "Estimate propensities if missing" in Advanced Settings (less reliable but better than nothing)
If using estimated propensities, checking that your logs contain sufficient information (scores, ratings) for estimation to work

"Decision: INCONCLUSIVE"

This has several possible causes:

Data-quality checks failed: Check the Gate & Diagnostics tab. If overlap is 12% but minimum is 20%, you need more overlap; if ESS is 750 but minimum is 1000, you need more effective samples. Consider lowering thresholds if they are too conservative for your use case, or collect more data if they are reasonable.
CI unavailable: Goes back to the propensity issue described above.
Uplift LCB between thresholds: Your uplift LCB is above -Δharm but below Δmin. Here, the evidence does not strongly point either way. This is often legitimate when the policy has a small positive effect that is hard to detect with current sample size. Check Power & MDE to see if you are underpowered.

"Stress test not run"

This means propensities are required for stress tests. If your main evaluation does not have propensities (either mapped or estimated), stress tests cannot run. Fix the propensity issue first, then stress tests will execute automatically using the same propensities as the main evaluation.

Column mapping not working

Ensure column names match exactly (they are case-sensitive). If auto-detection is not working, use "Configure Column Mapping" to manually select the correct columns from the dropdowns. The system recognizes common synonyms, but custom column names need manual mapping.

Pairwise diff shows "CACHE_MISS"

This means bootstrap replicates were not cached during evaluation. Re-run the leaderboard evaluation. Check that both policies have valid CI (ci_status == "ok"). If one policy has skipped gates, pairwise comparisons will not work.

Files not uploading

Check that:

File format is CSV or Parquet only
File size is not too large for your system
Try a different browser or clear cache if uploads hang

Best Practices

Always use true propensities when available. Estimated propensities may introduce additional uncertainty. If your logging system tracks exposure probabilities, include them in your logs file.
Start with default thresholds (20% overlap, 1000 ESS, 0.5% min uplift) as they are calibrated for typical recommendation systems. Only adjust them if you have strong reasons based on your specific context.
Check Gate & Diagnostics first before diving into detailed tabs. Understanding which checks passed and which failed guides where to focus your attention.
Review the Passport even if the decision seems clear. Passport contains critical context that future you (or your teammates) will appreciate when documenting what was checked and why a decision was made.
Review stress tests when available. Even if your policy looks strong in baseline conditions, stress tests can reveal fragility. A policy that degrades under inventory constraints might fail in production during high-demand periods.
Use the leaderboard for exploration when you have multiple candidates, and compare them all, as sometimes the third-ranked policy has better robustness characteristics or interpretability, making it a safer choice than the narrow winner.
Save Run IDs and Config Hashes for reproducibility. If someone questions your decision six months later, you can recreate the exact evaluation that informed your choice.
When using the leaderboard, make sure all candidate files use the same structure and represent comparable policies (same feature set, same user population). Do not compare policies trained on different user populations using the same logs, as the overlap will be poor and results misleading.

For questions or issues, please contact us at [email protected].

Try the Playground Now →