For ML Researchers
SuNaAI Lab
Technical Guide Series
Understanding causation in machine learning
Traditional machine learning excels at finding patterns and correlations in data. However, many real-world problems require understanding causation: What happens if we change X? Will treatment Y improve outcome Z? Causal inference provides the toolkit to answer these questions.
In many ML applications, we need to predict the effect of interventions and understand mechanisms. Causal inference enables us to move beyond associations to true cause-and-effect relationships.
Understanding the fundamental difference
Ice cream sales and drowning deaths are correlated (both increase in summer), but ice cream doesn't cause drowning. Correlation measures association; causation requires understanding mechanisms.
Factors that affect both treatment and outcome, creating spurious associations.
Example: Age confounds the relationship between exercise (treatment) and health (outcome) - older people exercise less and have worse health.
When individuals are not randomly assigned to treatment, creating systematic differences.
Example: Comparing job training programs between volunteers and non-volunteers is biased because volunteers might be more motivated.
Variables on the causal path that may mediate or modify the treatment effect.
Example: Education → Job Skills → Income. Job skills mediate the effect of education on income.
When the treatment variable is correlated with the error term in the model.
Example: Price and demand are endogenous - price affects demand, but high demand also affects price.
The Rubin causal model
The potential outcomes framework, also known as the Rubin causal model, is a powerful way to think about causal effects. It defines treatment effects as the difference between potential outcomes under different treatment conditions.
For each individual i, we define:
For each individual, we can only observe one potential outcome. We see Y(1) if they receive treatment, or Y(0) if they don't—never both.
Observed: Y_i = T_i × Y_i(1) + (1 - T_i) × Y_i(0)
We never observe: Both Y_i(1) and Y_i(0) together
The observed outcome under treatment is the potential outcome under that treatment.
Treatment assignment is independent of potential outcomes, conditional on covariates.
(Y(0), Y(1)) ⟂ T | X
Every individual has a positive probability of receiving both treatment and control.
No spillover effects—one person's treatment doesn't affect another person's outcome.
Balancing observed confounders
Propensity scores help balance treatment and control groups on observed covariates, reducing selection bias in observational studies. They're the probability of receiving treatment given observed characteristics.
e(x) = P(T = 1 | X = x) Where: - e(x) is the propensity score - T is treatment indicator (0 or 1) - X are observed covariates - P is probability
Match each treated unit with the closest control unit on propensity score.
Pros: Simple, preserves sample size
Cons: May leave good matches unused
Match each treated unit with k closest control units.
Pros: Better variance estimation
Cons: Requires multiple controls
Only match units within a certain distance threshold.
Pros: Ensures quality matches
Cons: May leave some units unmatched
Group units into strata based on propensity score quintiles.
Pros: Uses all data
Cons: May have poor balance within strata
import pandas as pd
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler
# Step 1: Estimate propensity scores
X = df[['age', 'education', 'income']] # covariates
T = df['treatment'] # treatment indicator
y = df['outcome'] # outcome
# Fit logistic regression to estimate propensity scores
model = LogisticRegression()
model.fit(X, T)
propensity_scores = model.predict_proba(X)[:, 1]
# Step 2: Match on propensity score
from sklearn.neighbors import NearestNeighbors
treated = df[df['treatment'] == 1]
control = df[df['treatment'] == 0]
# Find nearest neighbors
nn = NearestNeighbors(n_neighbors=1)
nn.fit(control[['propensity']])
# Match each treated unit to nearest control
distances, indices = nn.kneighbors(treated[['propensity']])
# Step 3: Estimate treatment effect on matched sample
matched_control = control.iloc[indices.flatten()]
ate = treated['outcome'].mean() - matched_control['outcome'].mean()
print(f"Average Treatment Effect: {ate:.2f}")Dealing with unobserved confounders
When unobserved confounders bias our treatment effect estimates, instrumental variables provide a way to identify causal effects. An instrument affects treatment but not outcome except through treatment.
# Stage 1: Regress treatment on instrument T_hat = a + b*Z + e # Stage 2: Regress outcome on predicted treatment Y = alpha + beta*T_hat + u # The coefficient beta is the causal effect estimate from statsmodels.sandbox.regression.gmm import IV2SLS result = IV2SLS(y, X, endog=T, instruments=Z).fit() print(result.summary())
In RCTs, randomization creates a natural instrument. Random assignment affects treatment but is independent of outcomes.
Policy changes that vary by region can serve as instruments for local treatment exposure.
Natural experiments like policy changes over time can create instrumental variation.
Legal changes that affect treatment but not outcomes directly can serve as instruments.
Learning causal structures from data
Causal discovery aims to learn the causal structure underlying observed data. This involves identifying directed edges in a causal graph and understanding which variables cause which.
Use conditional independence tests to identify causal structures.
Example: PC algorithm, GES (Greedy Equivalence Search)
Strengths: Foundation on mathematical principles
Limitations: Requires faithfulness assumption
Search over causal graphs and score them using likelihood or information criteria.
Example: GES with BIC score, LiNGAM
Strengths: Can handle complex structures
Limitations: Computationally expensive
Assume specific functional forms and use them to infer causality.
Example: ANM (Additive Noise Models), Causal Transfer
Strengths: Identifiable under certain assumptions
Limitations: Requires specific model assumptions
Use neural networks to learn causal structures from observational data.
Example: Neural Causal Models, DAG-GNN
Strengths: Can capture complex nonlinear relationships
Limitations: Requires large datasets
# Using PyCausal (Python)
import pydotplus
from pycausal import pycausal as pc
pc = pc.PyCausal()
# Learn causal graph
graph = pc.search(cmhc, depth=2, verbose=True)
print(graph.getNodes())
# Using DoWhy (Python)
from dowhy import CausalModel
model = CausalModel(
data=df,
treatment='treatment',
outcome='outcome',
graph=causal_graph
)
# Identify estimand
identified_estimand = model.identify_effect()
print(identified_estimand)
# Estimate effect
causal_estimate = model.estimate_effect(
identified_estimand,
method_name="backdoor.propensity_score_matching"
)Applications and implementations
Identify customers who respond to treatment (offer, ad, etc.) to optimize marketing campaigns and personalization.
Understand which features actually causally affect outcomes, beyond correlation-based importance.
Detect and mitigate causal discrimination in ML models, ensuring fair treatment across groups.
Provide explanations like "What would happen if we changed X?" for model predictions.
End-to-end causal reasoning library with identification, estimation, and refutation methods.
pip install dowhyMicrosoft's library for causal inference using machine learning methods.
pip install econmlUber's library for uplift modeling and causal inference with ML.
pip install causalmlGuidelines for causal inference
When treated and control groups differ systematically, standard methods can give biased estimates. Use randomization or careful matching.
Including variables affected by treatment in your analysis can bias estimates. Only adjust for pre-treatment variables.
Weak instruments can lead to large standard errors and biased estimates. Always test instrument strength.