Beyond the limitation of A/B Testing using Causal Inference

Hoang Dang
12 min readJun 24, 2024

--

Introduction

https://primo.ai/

In the realm of product management and development, understanding the impact of new campaign (aka treatment) releases on user behavior is crucial. With the recent release of a campaign, assessing its effect on key performance indicators, such as retention metrics (specifically, Day 1 retention or D1), becomes a pivotal task.

However, this task presents several challenges. While A/B testing is commonly employed to measure such impacts, it is not always a viable option due to ethical, practical, or financial constraints.

We might look at historical data; however, simply comparing the average retention rates of those who received the campaign and those who didn’t can lead to selection bias.

Example:

Imagine you run a retention campaign aimed at high-value customers:

  • Treatment Group (Received Campaign): High-value customers who are more engaged and have been with the company for a long time.
  • Control Group (Did Not Receive Campaign): A mix of high-value and low-value customers with varying engagement levels.

If you simply compare the average retention rates, you might find that the treatment group has higher retention. However, this higher retention might not be due to the campaign but rather because high-value, engaged customers are naturally more likely to stay with the company.

Consequently, alternative methodologies must be explored to estimate the effect of the new campaign, accounting for other factors and considering counterfactual scenarios — essentially, what the retention metrics would look like if the campaign had not been released.

Problem Statement

Estimating the effect of campaign on retention metrics involves navigating several complex issues:

1. Causal Relationship: What is the impact of the new campaign on the retention metric, and by how much? Determining this requires robust causal inference methods to establish a clear cause-and-effect relationship between the release and any observed changes in retention.

2. Attribution Analysis: Could other factors also impact the retention metric, and how can we isolate their effects?. This requires sophisticated statistical techniques to control for these variables and accurately attribute changes in retention to the intervention,ensuring that the observed effects are genuinely attributable to the new campaign.

Addressing these points is essential for accurately assessing the impact of treatment on retention metrics and making informed decisions based on these insights.

Question 1: What is the impact of the new campaign on the retention metric, and by how much?

Identifying the Factors Influencing Retention

Understanding retention involves recognizing that multiple factors can influence whether a user stays or churns.

For example:

  • Reasons for Churn: Users may leave due to dissatisfaction with the product, a poor user experience, or external factors unrelated to the product itself.
  • Reasons for Retention: Conversely, users may stay because they find the product valuable, enjoy new features and user acquisition sources, or have a high level of engagement with the platform.

To gain a more comprehensive understanding of which factors influence retention, it is essential to go beyond simple metrics and correlations. This involves delving into more sophisticated analytical techniques to isolate the true impact of the new campaign and account for various confounding factors. By doing so, we can better understand the drivers of retention and make more informed decisions about product improvements and campaign releases.

Steps to Determine the Average Treatment Effect (ATE)

1. Define Treatment and Outcome Variables:

  • Treatment: Treatment = 1, Control = 0)
  • Outcome: R1 (retention on Day 1, return = 1, churn = 0)

2. Estimate Causal Effect:
ATE Calculation: The ATE helps us understand the average effect of being exposed to the treatment compared to the control on the Day 1 retention rate.

Example:

Naive uplift = Average(retentionD1 | treatment = 1) − Average(retentionD1 | treatment = 0) = 0.33
Naive uplift = Avg(B) — Avg(C) = 0.33

We will predict whether a single user will churn or not, based on whether they were treated or not treated ( predicted values are Expected Outcome). For example, user A0002 in the real world is treated and they return. However, in a counterfactual world where they were not, they would have churned.

The key points are:

  • Making a prediction about a single user’s likelihood of churning, based on whether they received a treatment or not
  • Providing a specific example of a user (A0002) who returned in the real world as treatment, but would have churned in a hypothetical scenario where they were not treated.
ATE = Avg(D) - Avg(E) = 0.167

This means that, on average, the treatment ( new campaign ) increases the Day 1 retention rate by 0.167.

Importance of Counterfactual Approach

Avoiding Overly Optimistic Estimates:

  • Naive calculations, such as simply observing a retention rate of 0.33, can lead to overly optimistic estimates of the treatment effect.
  • This raw number does not account for what would have happened if users had not received the treatment.

Accurate Estimation:

  • Using a counterfactual approach allows for a more accurate estimation of the true effect of the treatment.
  • This method considers both the observed outcomes and the hypothetical scenarios, providing a more rigorous assessment.

Isolating the Treatment Impact:

  • The counterfactual approach isolates the impact of the treatment by comparing the actual outcome with the predicted outcome in the absence of treatment.
  • This helps to understand the causal effect of the treatment more precisely.

Principled four-step interface for causal inference

https://www.pywhy.org/
https://www.pywhy.org/

The image provides a flowchart that illustrates the process of using the DoWhy library for causal inference. This process involves four main steps: modeling causal mechanisms, identifying the target estimand, estimating the causal effect, and refuting the estimate.

1. Model Causal Mechanisms:
Define the causal graph where the treatment (treatment aka the campaign) and other variables impact the outcome (Day 1 retention). The causal graph represents the relationships between the treatment, the outcome, and other variables. It provides a visual and mathematical framework to model and analyze these relationships.

In my opinion, this is the most important and challenging stage. Furthermore, we need to ensure some Causal Assumptions:

  • No Unmeasured Confounders: All variables that causally affect both the treatment and the outcome are included in the graph.
  • Positivity: Every unit (user) has a non-zero probability of receiving each treatment level.
  • Consistency: The potential outcome under the observed treatment is the same as the observed outcome.

2. Identify the Target Estimand:
According to the causal graph, there are several methods for identifying a desired causal effect based on the graphical model. These methods leverage graph-based criteria and do-calculus to find potential expressions that can isolate the causal effect of interest. Some common techniques used for this purpose include:

  • Back-door criterion
  • Front-door criterion
  • Instrumental Variables
  • Mediation (Direct and indirect effect identification)

3. Estimate the Causal Effect:
Use a suitable estimation method, such as propensity score matching, to estimate the ATE. This involves matching treated users with similar untreated users based on observed characteristics.

4. Refute the Estimate:
Conduct robustness checks to ensure the estimated ATE is not sensitive to violations of causal assumptions. This might involve checking for hidden biases or performing placebo tests.

For illustration:

I illustrate the process of causal inference using a cooking example

Step 1: Construct the causal graph
Imagine you are planning to cook a dish. To make this dish, you need to identify all the ingredients and the relationships between them. Suppose the dish you want to make is “Italian beef ragu pasta”.

chatgpt

The nodes in your causal graph will include ingredients such as: pasta, beef, tomatoes, onions, garlic, spices, olive oil.

The edges represent how these ingredients influence each other and the final outcome. For example, the beef and tomatoes affect the flavor of the sauce, while the olive oil and garlic affect the aroma.

Step 2: Identify the estimand
This is the step where you determine the specific recipe for the Italian beef ragu pasta.

You need to define your recipe (the estimand), for example “how to make the pasta taste the best”.

chatgpt

There are different recipes for Italian beef ragu pasta, each giving a different flavor.

Similarly, in causal inference, you choose an appropriate method to estimate the causal effect, such as backdoor adjustment, frontdoor adjustment, or instrumental variables.

Step 3: Causal model
This is the step where you execute the cooking according to the chosen recipe.

You add the ingredients into the pot, following the order and method specified in the recipe.

chatgpt

In causal inference, you will apply the data to the chosen model and use the corresponding estimand.

The cooking tools can be pots, pans, stove; similarly, the tools in causal inference can be regression models or machine learning based methods.

Step 4: Refute the Estimate
After cooking the dish, you need to check if it meets the requirements.

You will taste the dish to check the doneness and flavor.

chatgpt

In causal inference, you will perform tests to verify the validity of the model and the causal effect estimates, such as sensitivity analysis or other randomization-based methods.

Through these four steps, you not only complete the dish, but also gain a deeper understanding of the causal inference process.

Now, let step into Causal Inference

Naive Approach

When we initially analyze the data using straightforward calculations, we observe that the retention rate for treatment is 0.69, while the retention rate for the control is 0.36. This yields an uplift of 0.34, which is a remarkable result on the surface.

The correlation coefficient between the new campaign and retention metrics is only 0.31. This raises the question: Can this number be trusted? The correlation coefficient alone does not tell us the extent of the uplift, and by some standards, a coefficient of 0.31 is considered weak.

However, correlation does not imply causality. Therefore, relying solely on the correlation coefficient may not provide a complete and accurate picture of the new campaign’s impact.

Causal “AI” 🙂

Step 1. Model Causal Mechanisms:

To better understand the impact of new campaign ( treatment ) on retention metrics, we turn to Causal AI, which allows us to uncover causal relationships and quantify the true effect of interventions. After collecting a wide range of variables and performing feature engineering, we identified 5 key variables that significantly impact Day 1 retention (R1) based on Expert Knowledge and Causal Discovery.

The key variables are:

  • Treatment: Indicates whether the user is on the campaign (1) or others (0).
  • r1: Represents Day 1 retention, where a user returning on Day 1 is marked as 1, and a user who churns is marked as 0.

In causal diagrams, an arrow from A to B (A→B) signifies that variable A impacts variable B.

To quantify the effect of treatment on retention, we will use the Average Treatment Effect (ATE).

Step2: Identify the Target Estimand:

model = CausalModel(data=df_normalized, treatment='treatment', outcome='r1',graph=G,missing_nodes_as_confounders = True)
identified_estimand= model.identify_effect(proceed_when_unidentifiable=True)
print(identified_estimand)

In our case, we will use the back-door criterion to identify the causal effect of the treatment (campaign) on the outcome (Day 1 retention) using the DoWhy library. The back-door criterion is a method for controlling confounding variables to estimate the causal effect accurately.

The back-door criterion involves selecting a set of variables (confounders) that, when conditioned upon, block all back-door paths from the treatment to the outcome. This ensures that the association between the treatment and the outcome is not confounded by other variables.

Step 3: Estimate the Causal Effect:

In this case, I use a simple approach: Regression. However, there are a lot of other powerful methods such as: Propensity Score, Doubly Robust, …

Based on the DoWhy library’s analysis, we have obtained the following estimand and causal effect estimate:

Realized Estimand
The realized estimand describes the regression model used to estimate the effect of the treatment on the outcome. In this case, retention D1 is binary outcome, so the model is:

Mean value: 0.17
95.0% confidence interval: (0.15, 0.2)

The estimate of 0.17 is consistent across different categorical values of these variables in the dataset.

This estimate indicates that increasing the treatment variable (treatment) from 0 to 1 causes an average increase of approximately 0.17 in the expected value of the outcome (Day 1 retention, r1), based on the data distribution/population represented by the dataset.

Interpret: The positive increase in Day 1 retention suggests that the campaign is effective in retaining users.This is a more refined and realistic estimate compared to the naive approach, which suggested a higher increase of 0.34. Let’s delve into why the causal approach provides a more reliable estimate and how it compares to the naive approach.

Why the Difference?

Control for Confounders:

  • The naive approach does not control for confounding variables, which can inflate the perceived effect of the campaign.
  • The causal approach controls for these variables, providing a more accurate estimate of the true effect.

Realistic Estimate:

  • The naive estimate of 0.34 may include the effects of other factors that are not related to the campaign itself.
  • The causal estimate of 0.17 isolates the effect of the campaign, leading to a lower but more realistic measure of its impact.

Final step: Refuting the Estimate: Ensuring Robustness of the Causal Effect

To ensure the robustness of our estimated causal effect, we need to refute the estimate by using placebo treatments or adding random common causes. This helps verify that the estimated effect is not driven by random chance or hidden biases.

Refutation Techniques and Results

Placebo Treatment Test:

  • Question: What happens to the estimated causal effect when we replace the true treatment variable with an independent random variable? (Hint: the effect should go to zero)
  • Purpose: To check if the observed effect could be obtained by chance.
    Method: Introduce a placebo treatment that should have no causal effect on the outcome.
  • Results:
  • The new effect is very close to zero, indicating that the original effect is unlikely to be due to random chance.
  • A high p-value (0.94) suggests that the placebo treatment has no significant effect, reinforcing the validity of the original estimate.

Random Common Cause Test:

  • Question: Does the estimation method change its estimate after we add an independent random variable as a common cause to the dataset? (Hint: It should not)
  • Purpose: To ensure that the estimated effect is not due to an unobserved common cause.
  • Method: Introduce a random variable as a common cause and check its impact on the estimated effect.
  • Results:
  • The new effect is almost identical to the original effect, suggesting that the estimated effect is robust to the addition of a random common cause.
  • A high p-value (0.98) indicates that the random common cause does not affect the outcome, further supporting the reliability of the original estimate.

Conclusion

By employing a rigorous causal inference approach, we obtained a realistic and reliable estimate of the campaign’s impact on Day 1 retention. This method addresses the limitations of naive calculations and ensures that the findings are robust and actionable. The campaign has a positive effect on retention, and these insights can be used to make informed decisions and enhance marketing effectiveness.

Next Step: Attribution Analysis

To address the second question of our analysis — Attribution Analysis: Could other factors also impact the retention metric, and how can we isolate their effects? — we will introduce and apply a GCM-based inference. This method will help us understand and attribute the impact of various factors on retention metrics more precisely.

References and Resources:

--

--