Off-Policy Evaluation for Recommenders: Foundations, Logging, Gating, and Validation
Abstract
Off-Policy Evaluation (OPE) estimates how a new ranking policy would have performed using historical interaction logs, without a live rollout. This practical guide covers the complete implementation pipeline: mathematical foundations, logging requirements, gating rules (overlap ≥0.8, ESS ≥1000, confidence intervals), and validation workflows. We use OPE as pre-A/B triage to block obvious failures and shortlist promising variants before consuming experiment resources. The guide includes code examples, benchmark datasets (KuaiRand, OpenBanditDataset), troubleshooting tips, and production implementation strategies. Only trust OPE when support and overlap are adequate. Verify that your logging policy explored enough of the action space to provide information about your target policy's decisions. In practice, we block any variant with overlap < 0.8 or ESS < 1k, even if the point estimate looks great. For broader context on how OPE transforms experimentation beyond traditional A/B testing, see our article on Beyond A/B Testing.
The Mathematical Foundation: Reweighting Historical Data
OPE builds on importance sampling, a method that estimates the expected values of a target distribution by reweighting samples drawn from another distribution (Hesterberg, 1995). The fundamental challenge is straightforward: you want to estimate the expected reward (clicks, purchases, or engagement time) under your target policy, but you only have rewards observed under your logging policy. Importance sampling solves this by reweighting observed rewards to correct for the difference between policies.
The intuition becomes clear through example. Suppose your target policy would show recommendation $X$ with 30% probability, but the logging policy only showed it 10% of the time. When $X$ appears in the logs, that observation needs heavier weighting when estimating the target policy's performance, specifically by a factor of 3 (calculated as 30% divided by 10%). Conversely, if the target policy would show recommendation $Y$ with 5% probability but the logging policy showed it 20% of the time, those observations get down-weighted by 0.25. Reweighting across all observations provides an unbiased estimate of the target policy's expected reward, assuming the logging policy had non-zero probability of showing every action the target policy might take.
Let me pause here. Before diving into the math, I want to emphasize something important: OPE is not magic. It is a tool that amplifies good data and amplifies bad data equally. If your logging policy never explored, OPE will not save you!
Inverse Propensity Scoring
Inverse Propensity Scoring (IPS) implements this reweighting directly. The estimator is:
$$\hat{V}_{\text{IPS}}(\pi) = \frac{1}{n} \sum_{i=1}^{n} \frac{\pi(a_i|x_i)}{\mu(a_i|x_i)} r_i$$
where $\hat{V}_{\text{IPS}}(\pi)$ is the IPS estimate of the policy value, or the expected reward you would get if you deployed policy $\pi$; $x_i$ is context, $a_i$ is item (or item-position), $r_i$ is reward, $\mu$ is logging policy, $\pi$ is target policy, and $w_i = \pi/\mu$.
The critical assumption underlying IPS is the support requirement: IPS provides unbiased estimates only when $\mu(a|x) > 0$ for all actions $a$ that $\pi(x)$ might take with positive probability. Deterministic logging policies that always show the same items will result in violation of this assumption. You need exploration: randomization, Thompson sampling, epsilon-greedy, or any mechanism that ensures your logging policy occasionally shows recommendations outside its top predictions. Without exploration, you have no information about how users would respond to different recommendations, and off-policy evaluation becomes either impossible or unreliable.
Controlling Variance Through Advanced Estimators
Importance sampling can have high variance, especially when logging and target policies diverge significantly. If your target policy wants to show recommendations that your logging policy rarely showed, you have few observations of those recommendations, and those few observations receive very high weights. A single unusual user response can dominate the entire estimate, making it unstable. This is OPE's fundamental trade-off: more exploration in your logging policy means lower variance and more accurate policy evaluation, but exploration typically costs immediate performance.
Self-Normalized Importance Sampling (SNIPS) addresses variance by normalizing weights to sum to one rather than the number of samples:
$$\hat{V}_{\text{SNIPS}}(\pi) = \frac{\sum_i w_i r_i}{\sum_i w_i}$$
This trades a bit of bias for lower variance. Actually, let me be more precise here—SNIPS introduces bias when the sum of weights differs from the sample size, but this bias is typically small and decreases as sample size grows. In practice, this bias-variance trade-off usually works in your favor. SNIPS is more stable than IPS and often provides better estimates, particularly when importance weights have high variance.
Doubly Robust (DR) estimation combines importance weighting with model-based prediction:
$$\hat{V}^{\text{DR}}(\pi) = \frac{1}{n} \sum_{i=1}^{n} \left[ \frac{\pi(a_i|x_i)}{\mu(a_i|x_i)} (r_i - \hat{q}(x_i, a_i)) + \hat{E}_{a \sim \pi(\cdot|x_i)}[\hat{q}(x_i, a)] \right]$$
where $\hat{q}(x,a)$ is a learned model predicting expected reward for action $a$ in context $x$. You use models (often the same ones powering your target policies) to predict expected rewards, then use importance weighting to correct any bias in those predictions. When your models are accurate, you get low variance similar to pure model-based estimation. When models are wrong, importance weighting corrects them. That's why DR stays usable when either the model or the weights misbehave (Dudík et al., 2011).
Marginal Importance Sampling exploits the structure of recommendation systems where you typically show multiple items simultaneously. Instead of weighting based on the full slate of recommendations, you weight based on individual item-position pairs (Gilotte et al., 2018). This dramatically reduces variance because individual items appear more frequently than specific slates. For slate-level evaluation, you sum contributions across positions:
$$\hat{V}^{\text{MIS}}(\pi) = \frac{1}{n} \sum_{i=1}^{n} \sum_{j=1}^{K} \frac{\pi(a_{ij}|x_i, j)}{\mu(a_{ij}|x_i, j)} r_{ij}$$
where $a_{ij}$ is the item shown at position $j$ and $r_{ij}$ is the position-specific reward (typically 1 if the user clicked that position, 0 otherwise).
We introduced IPS, SNIPS, DR, and MIS for slates (weights item–position pairs); for further variants, see Switch-DR (fallback to model when weights are large), MRDR (trains the reward model to minimize DR's MSE), and DROS (DR with optimized shrinkage to stabilize extreme weights).
What to Log: Building the Foundation for Evaluation
Implementation requires careful logging infrastructure. At minimum, you need recommendations shown to each user, user actions in response, and enough information to compute propensity scores for every action taken. For position-based recommenders, log scores or probabilities for all recommended items, and ideally for a broader candidate set beyond what was actually shown. For contextual bandits or reinforcement learning approaches, log context features, actions taken, and reward signals.
Most companies discover their existing logging infrastructure captures much of what is needed but may be missing pieces like the full candidate set or algorithm uncertainty estimates. Common logging pitfalls include recording model scores or ranks instead of the actual exposure probability used by the serving policy. For OPE, you must log $\mu(a|x)$ for the shown item–position (and the candidate set at decision time), otherwise IPS/SNIPS/DR weights are invalid. Fortunately, logging infrastructure is usually easier to improve than experimentation infrastructure.
The essential logging schema for ranking systems should include:
Required fields:
- user_id or session identifier to track individual interactions
- item_id for each recommended item
- position indicating rank in the slate (1 through K)
- score_or_prob capturing the ranker's score or exposure probability $\mu(a|x)$
- shown indicator (1 if displayed, 0 if candidate only)
- clicked or other binary/continuous reward signal
- timestamp for temporal analysis
- policy_id identifying which algorithm version generated these logs
Recommended fields:
- candidate_set_size documenting how many items were eligible for recommendation
- experiment_bucket if running concurrent tests
If you only have scores rather than probabilities, convert scores to propensities using softmax over the candidate set, then multiply by a position-bias model. The conversion is:
$$\mu(a|x) = \frac{\exp(s(a|x))}{\sum_{a' \in \mathcal{A}} \exp(s(a'|x))},$$
where $s(a|x)$ is the score for item $a$ in context $x$ and $\mathcal{A}$ is the candidate set. For position-aware propensities, multiply by the examination probability:
$$\mu(a, j|x) = \mu(a|x) \cdot p_{\text{exam}}(j),$$
where $j$ is the position. If position bias is unknown, it can be estimated from historical data. A simple examination model assumes that the probability of a user examining position $j$ decays exponentially: $p_{\text{exam}}(j) = \gamma^{j-1}$ where $\gamma$ is often around 0.7. Users examine the first item with high probability, the second item less often, and deeper positions with progressively lower probability. More sophisticated models can be learned directly from interaction data.
If your logging policy is deterministic (always showing the same recommendations to similar users), IPS is not valid because $\mu(a|x) = 0$ for most actions. You have three options in that case: introduce randomization going forward and wait for exploratory data to accumulate, use SNIPS or DR as diagnostics only while validating everything through online tests, or run interleaving experiments that generate propensities by comparing pairs of algorithms.
Position Bias and Slate-Level Evaluation
One practical consideration deserves emphasis: position bias and other contextual effects significantly impact recommendation evaluation. Users click more on recommendations in prominent positions regardless of content quality (Joachims et al., 2017). When computing propensities, you must account for these factors. If your logging policy placed item $X$ in position 1 and the target policy would place it in position 3, the relevant propensity concerns the item-position combination, not just the item alone. Sophisticated OPE implementations model these effects explicitly, either treating position as part of the context or learning position bias models from data and incorporating them into evaluation.
Ranking systems show multiple items simultaneously, and position matters substantially. Standard item-level IPS overestimates performance if it ignores position bias, the tendency for users to click higher-ranked items simply because they are more visible. The proper approach treats (item, position) as the action. Instead of computing $\pi(\text{item}|x)$, compute $\pi(\text{item, position}|x)$. This requires position-aware propensities: log not just which items were shown but where they appeared in the slate. If your ranker scores items independently, multiply item propensities by position-click probabilities to get joint probabilities.
Slate-aware estimators account for item interactions within slates. Pseudoinverse estimators and marginal importance sampling handle cases where showing item $A$ at position 1 affects the value of showing item $B$ at position 2. These methods recognize that slate-level performance depends on both individual item quality and the composition of the entire recommendation set. If you cannot model position effects reliably, the conservative approach restricts OPE claims to top-1 recommendations only, or validates ranking changes through online interleaving experiments rather than relying solely on off-policy estimates.
Gating Rules: Knowing When to Trust Your Estimates
Off-policy evaluation should serve as a screening mechanism rather than a final arbiter. The criteria below address distinct failure modes. Proceed to online testing only when all criteria are satisfied.
1. Support and Overlap
Failure mode: If the target policy proposes actions that the logging policy almost never took, importance weights become extreme and the estimate becomes unreliable.
Computation: Let $\mathcal{A}_{\pi}(x)$ denote the top $K$ actions the target policy would take for context $x$, and $\mathcal{A}_{\mu}(x)$ the actions the logging policy actually took. The overlap is:
$$\text{overlap} = \frac{\sum_x |\mathcal{A}_{\pi}(x) \cap \mathcal{A}_{\mu}(x)|}{\sum_x |\mathcal{A}_{\pi}(x)|}.$$
Treat the action as the pair of item and position. Compute per business segment when appropriate, then aggregate.
Threshold and rationale: Require overlap at least 0.80. Below 0.80, at least one in five target actions is unsupported, which destabilizes IPS, SNIPS, and DR (Thomas et al., 2015). Empirical validation studies show that overlap below 0.7–0.8 leads to estimation variance that grows exponentially with the number of unsupported actions, making reliable confidence intervals impractical. For high-risk launches, prefer 0.90. The 80% threshold works well for most recommendation scenarios, but can be relaxed to 70% for large catalogs with long-tail items where perfect overlap is impractical, or tightened to 90% for high-stakes financial or safety-critical applications where false positives are costly.
Remedy when failing: Introduce exploration, expand candidate logging, or restrict scope to top-1 recommendations.
2. Effective Sample Size
Failure mode: A small number of very large weights can dominate the estimate and inflate variance.
Computation: Effective Sample Size (ESS) measures how many "effective" observations remain after reweighting:
$$\text{ESS} = \frac{\left(\sum_i w_i\right)^2}{\sum_i w_i^2},$$
computed on the exact weights used in the estimator, after clipping if clipping is applied.
Threshold and rationale: Require ESS at least 1,000 as a baseline. The standard error of a mean scales with the inverse square root of ESS. Around 1,000, intervals are typically tight enough for a decision to test or block. This threshold ensures sufficient effective observations for the central limit theorem to provide reasonably accurate normal approximations for confidence intervals. With ESS < 1,000, bootstrap confidence intervals become unreliable, and the empirical variance of importance-weighted estimators can fluctuate significantly across resamples. For fine ranking among similar variants, aim for at least 5,000.
Remedy when failing: Increase exploration or logging window, adopt lower variance estimators such as SNIPS, DR, or MIS, or narrow the evaluation scope.
3. Weight Magnitude and Clipping
Failure mode: Modest mismatches between target and logging policies can create extreme weights and allow a small set of rows to determine the outcome.
Computation: Inspect maximum and high percentiles such as the 95th and 99th percentiles. Run sensitivity with clipping caps $\tau \in {5, 10, 20, 50}$. Define clipped weights $w_i^{(\tau)} = \min(w_i, \tau)$.
Threshold and rationale: Use $\tau$ between 10 and 20. The choice of $\tau \in {10, 20}$ is a practical heuristic derived from empirical studies where these values effectively bound the variance of clipped IPS estimates while keeping bias manageable (Su et al., 2020). Require that clipping affects at most 2% of the total weight mass and that conclusions remain unchanged across reasonable $\tau$. This controls variance while introducing limited bias. If results change dramatically across clipping thresholds, your estimate is fragile, highly dependent on a few extreme observations that may not be representative.
Remedy when failing: Revisit propensity estimation, improve overlap, or reject the estimate as fragile.
4. Confidence Interval Width
Failure mode: A strong point estimate can mask large uncertainty.
Computation: Bootstrap resampling or influence functions provide 95% confidence intervals around point estimates. Use a stratified or block bootstrap that respects time or session dependence, or influence function approximations. Report absolute width and relative half-width.
Threshold and rationale: For sending a variant to an online test, require relative half-width at most 10–20% of the point estimate. For critical deployment or promotion decisions, require at most 5–10%. These thresholds derive from standard statistical practice: when a 95% confidence interval crosses zero, you cannot reject the null hypothesis of no treatment effect, meaning there is insufficient evidence to distinguish the policy from baseline at conventional significance levels. If the 95% CI crosses zero, we do not test. No exceptions.
Field note: When CI width is borderline, we still run an A/B test, but at smaller traffic and shorter duration.
Remedy when failing: Collect more data, reduce variance with SNIPS, DR, or MIS, increase exploration, or limit scope to top-1.
5. Distribution Drift
Failure mode: If the evaluation window and the decision window differ materially, historical logs may not represent current behavior.
Computation: Compare reward distributions between the logging period and current period. Quantities include reward shift and covariate shift: Wasserstein distance for rewards, Kullback–Leibler divergence for binned click-through rates, population stability index (PSI) for key features, item churn rate, and policy shift in positions and scores.
Threshold and rationale: Flag when the population stability index exceeds 0.25, or when reward shift leaves the historically safe band calibrated from retrospective comparisons of OPE and online results. Distribution shift violates the fundamental assumption of OPE that historical and current data distributions are similar; when this assumption fails, importance sampling estimators become biased because the reweighting scheme no longer corrects for distributional differences (Kallus & Zhou, 2018). User preferences shift over time, content catalogs change, seasonal effects influence engagement, and market conditions evolve. Drift weakens the mapping from offline estimates to online performance.
Remedy when failing: Re-estimate on fresher data, segment by regime such as season, or tighten other gates by requiring higher overlap, higher ESS, and narrower intervals.
6. Stability Across Methodological Choices
Failure mode: Conclusions that depend on a single modeling choice are brittle.
Computation: Check agreement across IPS, SNIPS, and DR; across clipping values; and across adjacent calendar slices. For each estimate, report how it varies with clipping thresholds, propensity model choices, and confidence interval methods.
Threshold and rationale: Require directional agreement and similar magnitudes across reasonable methodological choices. If the sign changes or magnitudes vary by more than about 30% across reasonable settings (similar predictions whether you clip at 10 or 20, whether you use IPS or SNIPS, whether you bootstrap with 100 or 1000 samples), treat the estimate as unstable. Stability across methodological choices suggests the signal is robust; instability suggests the signal is weak and dominated by noise or a few influential observations.
Remedy when failing: Diagnose propensity estimation, position bias modeling, and overlap. If instability persists, block.
Decision Rule
When all information quality gates pass (overlap, ESS, CI width, stability), apply harm/benefit tests using the one-sided 95% lower confidence bound (LCB) of mean-centered uplift.
- NO_SHIP: If LCB < $-\Delta_{\text{harm}}$ (typically −0.5% to −1.0% of baseline), block deployment due to evidence of harm.
- SHIP: If LCB ≥ $\Delta_{\text{min}}$ (typically 0.5% to 1.0% of baseline), advance to A/B testing due to evidence of benefit.
- INCONCLUSIVE: If LCB falls between $-\Delta_{\text{harm}}$ and $\Delta_{\text{min}}$, or if any information quality gate fails, mark as inconclusive.
Calibrate $\Delta_{\text{harm}}$ and $\Delta_{\text{min}}$ to your business context: tighter for high-stakes applications (0.2–0.5%), more relaxed for rapid iteration (1.0–2.0%). Block deployment if any information quality gate fails, regardless of point estimates.
We treat OPE as a high-quality filter that raises the hit rate of online experiments, not a replacement for validation. When estimates are reliable by these standards, they provide strong evidence about algorithm performance. When estimates fail these checks, they remain informative in ruling out obviously poor performers but should not be trusted for fine-grained distinctions.
Building Trust: Validation Against A/B Tests
Before relying on OPE for high-stakes decisions, validation is essential. Validation establishes that offline estimates match online results in your specific domain, with your specific logging infrastructure, and under your specific business constraints. This is not a one-time exercise but an ongoing practice that builds institutional knowledge about when OPE can be trusted.
Retrospective validation: Choose 5-10 algorithms you have already A/B tested in the past. Run OPE on logged data from before those tests were conducted, using only the data that would have been available when making the testing decision. Plot OPE estimates against actual A/B test lifts. Compute correlation and mean absolute error. This reveals how well OPE predictions match reality in your environment. Perfect correlation is not necessary; what matters is the relationship between offline estimates and online performance. If high OPE scores consistently translate to high A/B test lifts, OPE provides actionable information.
Prospective validation: Use OPE to rank 5 candidate algorithms, then A/B test all 5 simultaneously. Check whether OPE ranking matches online ranking. While agreement provides confidence, disagreement triggers diagnosis: was overlap insufficient for certain algorithms? Has user behavior shifted since the logging data was collected? Did position bias or slate effects distort estimates? Each discrepancy teaches you something about your data quality, exploration strategy, or evaluation methodology.
Continuous monitoring: Each time you A/B test a new algorithm, log the OPE estimate beforehand. Track the ratio of OPE-predicted lift to actual lift over time. If this ratio drifts (OPE consistently over-predicts or under-predicts), your logging policy or user behavior has changed in ways that affect estimate quality. This early warning system catches problems before they lead to bad decisions. Consistent monitoring also identifies which types of algorithms OPE evaluates well and which types require more cautious interpretation.
Sensitivity analysis: For each estimate, report how it varies with clipping thresholds, propensity model choices, and confidence interval methods. If estimates remain stable across perturbations (similar predictions whether you clip at 10 or 20, whether you use IPS or SNIPS, whether you bootstrap with 100 or 1000 samples), trust increases. If estimates vary wildly, treat them as uninformative. While stability across methodological choices suggests the signal is robust, instability indicates signal weakness and domination by noise or a few influential observations.
Documentation of both successes and failures builds organizational learning. Share cases where OPE correctly predicted A/B results: which algorithms it ranked accurately, what gating checks passed, what characteristics made those estimates reliable. Share cases where OPE was misleading: which algorithms it misjudged, which checks should have warned you, what you learned about your data or evaluation process. This institutional memory helps your team develop calibrated intuitions about when to trust OPE and when to require online validation. It also makes OPE adoption less about faith in the methodology and more about empirical track record.
When to Use OPE: A Decision Funnel
Where does OPE actually pay for itself? Off-policy evaluation fits into recommendation development pipelines at specific stages where its speed and safety advantages provide the most value. The key is not just matching the method to the decision, but understanding what OPE can and cannot tell you. This distinction matters more than most people realize.
Rapid prototyping explores broad spaces of possibilities. When you have dozens or hundreds of algorithm variations (different architectures, hyperparameters, feature sets, or algorithmic approaches), OPE eliminates poor performers quickly. Algorithms that score poorly in OPE almost certainly will fail online. This is valuable even when estimates are noisy, since the goal isn't to find the best yet, it's to remove the worst. If your team generates 100 ideas per quarter and can only A/B test 10, OPE lets you confidently discard 80 weak ideas and focus testing resources on the 20 most promising.
Shortlisting narrows to the best candidates for online validation. After filtering out obvious failures, you have 10-20 promising candidates and need to choose which merit the cost of A/B testing. Use OPE with strict gating (overlap ≥ 0.9, ESS ≥ 5,000, tight confidence intervals) to rank candidates. Select the top 3-5 for online validation. This phase demands higher estimate quality than rapid prototyping because you are making finer distinctions between competitive algorithms.
Hyperparameter tuning presents a perfect use case for OPE. You have one algorithm and need to tune regularization parameters, learning rates, embedding dimensions, or other hyperparameters. The combinatorial space often contains thousands of configurations. Grid-searching this space with A/B tests would take years. With OPE, evaluate all configurations offline in days, identify the Pareto frontier of options balancing different objectives, then validate the frontier online. You might then select a few configurations from that frontier for final validation, or simply deploy the configuration that looks best with high confidence. This is where we see the biggest wins from OPE—the difference between testing 10 configurations vs. 1000 configurations offline is enormous, and the cost is essentially zero once you have the infrastructure.
Pre-deployment validation assesses major algorithmic changes before investing in production deployment. When considering switching from collaborative filtering to neural networks, changing your content corpus, or modifying your ranking function, OPE estimates expected lift and downside risk. This helps teams make informed decisions about which changes merit the engineering investment required for production deployment. For strategic changes that touch multiple parts of the recommendation system, OPE evaluates the combined effect without deploying complex multi-factor experiments.
Post-deployment analysis often gets overlooked but provides valuable learning. After deploying a new algorithm, use OPE on production data to evaluate variations you did not deploy. This reveals whether you made the optimal choice or whether a slightly different hyperparameter setting would have worked even better, informing your next iteration. You can also monitor for concept drift by periodically evaluating your current policy and alternatives on recent data, providing early warning if your algorithm's relative performance is degrading.
Do not use OPE for new product features that fundamentally change user experience, long-term behavioral shifts that manifest over weeks or months, network effects where one user's recommendations influence others, or when your logging policy has been purely deterministic without any exploration. These scenarios require online testing because OPE cannot capture the dynamics involved.
Before deploying OPE in production, validate your implementation on public datasets with known online results. This builds confidence in your code and helps you understand what factors affect estimate quality. Start with datasets like KuaiRand-1K or Open Bandit Dataset that have high-exploration logging policies, which provide excellent support for OPE evaluation.
Limitations (What OPE Cannot Do):
Off-policy evaluation has fundamental limits that determine when it can and cannot be trusted:
-
Support requirement: OPE can only evaluate policies that take actions your logging policy also took. Requires exploration through randomization, Thompson sampling, or epsilon-greedy.
-
Policy divergence: Radically different algorithms have high uncertainty. OPE works best for evaluating variations of your existing approach.
-
Distribution shift: User preferences, content availability, or market conditions may change between data collection and evaluation.
-
Long-term effects: OPE measures immediate rewards but cannot capture delayed impacts that unfold weeks or months later.
-
Network effects: OPE's assumption of user independence breaks down when one user's experience is influenced by other users' recommendations.
-
Model misspecification: Doubly robust methods can be biased if your reward model is systematically wrong.
OPE is not a replacement for A/B testing. It is a high-quality filter that increases the hit rate of online experiments by identifying promising candidates and eliminating obvious failures before they consume testing resources.
Implementation: Building the Infrastructure
Implementing off-policy evaluation in production requires three components: logging infrastructure, computation pipeline, and validation workflow. Each piece builds on the others to create a system that reliably estimates algorithm performance.
Logging Infrastructure and Exploration
If your current system is deterministic (always showing the same recommendations for similar users), the first step is introducing exploration. Without exploration, your data contains no information about how users respond to recommendations outside your current top predictions, making OPE impossible. The simplest approach is epsilon-greedy exploration: with probability $\epsilon$, replace one recommendation with a random item. Even $\epsilon = 0.01$ or $\epsilon = 0.02$ can provide valuable data for OPE while having minimal impact on immediate performance, a 1–2% exploration rate usually costs less than 1% on engagement metrics. This 1% cost estimate assumes typical e-commerce or content recommendation scenarios with moderate user engagement; the actual cost may be higher for low-engagement domains (news, job recommendations) or lower for high-engagement domains (social media, entertainment) where users are more tolerant of exploration.
More sophisticated exploration strategies can actually improve online performance while providing richer data for evaluation. Thompson sampling samples recommendations according to your uncertainty about their quality, naturally balancing exploration and exploitation. Items you are confident about get recommended often; items you are uncertain about get explored proportionally to their potential. Upper confidence bound methods explicitly favor items you are uncertain about, encouraging exploration of potentially good but under-explored recommendations. These approaches often improve online metrics relative to pure exploitation while simultaneously providing better support for off-policy evaluation.
Log propensities, not just which items were shown. For scoring-based rankers, log scores for all candidates or at least the top 100. For probabilistic policies, log sampling probabilities. For contextual bandits, log context features and action probabilities. The goal is to be able to reconstruct, for any recommendation that was shown, how likely your logging policy was to show it in that context. This reconstructability enables computing importance weights accurately.
Computation Pipeline
Start with a minimal IPS implementation, validate it thoroughly, then add SNIPS and DR. The basic code structure:
# Minimal IPS with clipping
w = pi_target(a|x) / mu_logging(a|x)
w = np.minimum(w, tau) # clip at tau; if this line changes the rank order, your weights are too spiky
ips = np.mean(w * r)
snips = np.sum(w * r) / np.sum(w)
ess = (np.sum(w))**2 / np.sum(w**2)
# Bootstrap 95% confidence interval
ips_samples = [ips_estimate(resample(data)) for _ in range(1000)]
ci_lower, ci_upper = np.percentile(ips_samples, [2.5, 97.5])
Report the point estimate, confidence interval, effective sample size, maximum weight, overlap, and a ship/block decision based on gating rules. This provides actionable information: not just an estimate but a judgment about whether that estimate is reliable enough to inform decisions.
For doubly robust estimation, you need a reward model $\hat{q}(x,a)$ that predicts expected reward. This is often the same model powering your target policy: a neural network, matrix factorization, or gradient boosting model trained to predict user engagement. The DR estimator combines this model with importance weighting to get the best of both worlds: low variance when the model is accurate, bias correction when the model is wrong.
Validation Workflow
Choose a past A/B test where you have both logged data from before the test and known online results. Run OPE on the pre-test data and compare your estimate to the actual lift observed in the A/B test. If they match within acceptable error, your implementation is correct. If not, debug systematically: check propensity calculations, verify position bias handling, confirm reward definitions match between offline and online evaluation, and examine data filtering for discrepancies.
Repeat this validation for 5-10 past tests covering different types of algorithms and different time periods. Measure the correlation between OPE estimates and online lifts. Use this correlation to calibrate your confidence in future OPE estimates. If correlation is 0.8 or higher, you can trust OPE for fine-grained algorithm selection. If correlation is 0.5-0.7, OPE is useful for filtering but requires online validation for close decisions. If correlation is below 0.5, investigate whether insufficient exploration, poor propensity estimation, or distribution drift is degrading estimate quality.
Integration into development workflows makes OPE effective. When someone trains a new model, automatically run OPE and include results in the model card alongside offline metrics like AUC or NDCG. Treat it as routine infrastructure, not a special tool requiring manual intervention. The easier you make it to get OPE estimates, the more consistently your team will use them, and the more value you will extract from your historical data.
The Path Forward
Off-policy evaluation changes the economics of innovation in recommendation systems by decoupling evaluation from production risk. The marginal cost of evaluating one more algorithm is essentially just computation time, orders of magnitude cheaper than A/B testing. Companies building new recommendation systems should design for OPE from the start rather than retrofitting it later, making exploration and propensity logging first-class architectural concerns from day one.
References
Dudík, M., Langford, J., & Li, L. (2011). Doubly robust policy evaluation and learning. Proceedings of the 28th International Conference on Machine Learning, 1097-1104.
Gilotte, A., Calauzènes, C., Nedelec, T., Abraham, A., & Dollé, S. (2018). Offline A/B testing for recommender systems. Proceedings of the 11th ACM Conference on Recommender Systems, 198-206.
Hesterberg, T. (1995). Weighted average importance sampling and defensive mixture distributions. Technometrics, 37(2), 185-194.
Joachims, T., Swaminathan, A., & Schnabel, T. (2017). Unbiased learning-to-rank with biased feedback. Proceedings of the 10th ACM International Conference on Web Search and Data Mining, 781-789.
Kallus, N., & Zhou, A. (2018). Policy evaluation and optimization with continuous treatments. Proceedings of the 21st International Conference on Artificial Intelligence and Statistics, 1243-1251.
Su, Y., Dimakopoulou, M., Krishnamurthy, A., & Dudík, M. (2020). Doubly robust off-policy evaluation with shrinkage. Proceedings of the 37th International Conference on Machine Learning, 9167-9176.
Swaminathan, A., & Joachims, T. (2015). The self-normalized estimator for counterfactual learning. Advances in Neural Information Processing Systems, 28, 3231-3239.
Thomas, P., Theocharous, G., & Ghavamzadeh, M. (2015). High confidence policy improvement. Proceedings of the 32nd International Conference on Machine Learning, 2380-2388.