Understanding Difference-in-Differences: A Practical Guide Using Stata

Exploring quasi-experimental methods through a health policy case study
stata
regression
difference-in-difference
Author

Nadhira A. Hendra

Published

January 15, 2026

Modified

January 19, 2026

Introduction

This analysis started as one of my class assignments in Economic Development at Columbia. The task was to think through experiment designs of a study evaluating mobile health clinics on prenatal care utilization in rural Rajasthan, India.

Unlike my previous post on World Bank indicators (which was purely correlational), this one dives into causal inference. DiD lets us estimate treatment effects when we have observational data before and during the period of intervention.

DiD is one of the most widely used quasi-experimental method in empirical economics. It is particularly valuable in context where randomized controlled trial may be infeasible. However, this method also often use in combination with randomization because the key insight of DiD is that we can observe the change over time in untreated units, adjusting any baseline difference between control and treatment groups.

In this post I’ll walk through the design, the assumptions to make, and show how to implement it in Stata. This post serves as both a personal reference and hopefully can be a useful resource for others learning causal inference methods.

The Problem

Imagine researchers want to evaluate a mobile health clinic program that began operating in treatment village in January 2023. The intervention involves sending mobile clinics to randomly selected villages once per week, offering free prenatal checkups, basic medications, and health education to pregnant women.

The researchers was able to collected data on prenatal care visits:

  • Before: 2022 (pre-treatment)
  • After: 2023 (post-treatment)

Why Randomize at the Village Level?

A natural question: why not randomize at the individual level? Why assign entire villages to treatment or control?

Since we’re including randomization in this experiment design. one of the assumption need to be made is SUTVA or Non Interference. If we randomized at individual levels, meaning that in one village there could be a group of woman that is assigned to treatment and another groups that is assigned to control, we’d face several problems:

  1. Spillover effects: If a woman in the treatment group shares information about prenatal benefits with her neighbor in the control group, or if the control woman sees treatment women getting checkups and decides to seek care herself, we’d underestimate the true treatment effect.

  2. Correlated outcomes: Women within the same village are more alike than women across villages (similar cultural norms, access to facilities, socioeconomic conditions). This means observations aren’t truly independent.

  3. Understated standard errors: Our regression assumes each observation contributes equally independent information. When observations are clustered, we have less effective information than the sample size suggests.

By randomizing at the village level, we cleanly separate treatment and control groups and avoid these violation.

The DiD Framework

The 2x2 Grid

What I like about DiD is the simplicity. To give a better visualization, to be able to calculate the result we need four group means:

Control Villages Treatment Villages
Before (2022) \(\bar{Y}_{00}\) \(\bar{Y}_{10}\)
After (2023) \(\bar{Y}_{01}\) \(\bar{Y}_{11}\)

Where:

  • \(\bar{Y}_{00}\) = average prenatal visits before treatment in control group
  • \(\bar{Y}_{01}\) = average prenatal visits after treatment in control group
  • \(\bar{Y}_{10}\) = average prenatal visits before treatment in treatment group
  • \(\bar{Y}_{11}\) = average prenatal visits after treatment in treatment group

The DiD estimator is:

\[\text{DiD} = (\bar{Y}_{11} - \bar{Y}_{10}) - (\bar{Y}_{01} - \bar{Y}_{00})\]

In words: take the change in the treatment group, subtract the change in the control group.

The Regression Equation

We can express this in regression form:

\[Y_{it} = \alpha + \beta \cdot Treatment_i + \gamma \cdot After_t + \tau \cdot (Treatment_i \times After_t) + \epsilon_{it}\]

Where:

  • \(\alpha\): baseline mean for control group in the pre-period
  • \(\beta\): time-invariant difference between treatment and control (selection effect)
  • \(\gamma\): time trend common to both groups (what would have happened anyway)
  • \(\tau\) (the DiD estimate) : the causal effect of mobile clinics on prenatal visits

The coefficient \(\tau\) is what we care about. It captures the additional change in the treatment group beyond what the control group experienced.

Exploratory Analysis

Setup

For reference on setting up Stata in R, you can refer here.

Simulating the Data

For illustration, I’ll create a dataset that mimics what researchers might have collected.

Code
clear

* Set seed for reproducibility
set seed 12345

* Create village-level data
set obs 100
generate village_id = _n
generate treatment = (village_id <= 50)

* Expand to woman-level (10 women per village)
expand 10
bysort village_id: generate woman_id = _n

* Expand to panel (before and after)
expand 2
bysort village_id woman_id: generate period = _n - 1
label define period_lbl 0 "Before (2022)" 1 "After (2023)"
label values period period_lbl

* Generate outcome with DiD structure
* Base: 2 visits on average
* Treatment villages slightly lower at baseline: -0.2
* Time trend for everyone: +0.3
* Treatment effect: +0.8 visits

generate prenatal_visits = 2 + ///
    (-0.2) * treatment + ///
    (0.3) * period + ///
    (0.8) * treatment * period + ///
    rnormal(0, 0.8)

* Round to integers (can't have fractional visits)
replace prenatal_visits = round(prenatal_visits, 1)
replace prenatal_visits = 0 if prenatal_visits < 0

save "did_simulation.dta", replace

describe
Number of observations (_N) was 0, now 100.



(900 observations created)


(1,000 observations created)





(2,000 real changes made)

(1 real change made)

file did_simulation.dta saved


Contains data from did_simulation.dta
 Observations:         2,000                  
    Variables:             5                  19 Jan 2026 07:37
-------------------------------------------------------------------------------------------
Variable      Storage   Display    Value
    name         type    format    label      Variable label
-------------------------------------------------------------------------------------------
village_id      float   %9.0g                 
treatment       float   %9.0g                 
woman_id        float   %9.0g                 
period          float   %13.0g     period_lbl
                                              
prenatal_visits float   %9.0g                 
-------------------------------------------------------------------------------------------
Sorted by: village_id  woman_id

Computing Group Means

Code
use "did_simulation.dta", clear

* Create the 2x2 table
table treatment period, statistic(mean prenatal_visits) statistic(sd prenatal_visits)
                       |                   period                
                       |  Before (2022)   After (2023)      Total
-----------------------+-----------------------------------------
treatment              |                                         
  0                    |                                         
    Mean               |           2.06          2.332      2.196
    Standard deviation |       .8589969       .8527781   .8662184
  1                    |                                         
    Mean               |           1.81          2.946      2.378
    Standard deviation |        .831434       .8789933   1.026728
  Total                |                                         
    Mean               |          1.935          2.639      2.287
    Standard deviation |       .8541104       .9184348   .9539843
-----------------------------------------------------------------

Running the DiD Regression

Code
use "did_simulation.dta", clear

* Generate interaction term
generate treat_post = treatment * period

* DiD regression with clustered standard errors
regress prenatal_visits treatment period treat_post, cluster(village_id)
Linear regression                               Number of obs     =      2,000
                                                F(3, 99)          =     133.22
                                                Prob > F          =     0.0000
                                                R-squared         =     0.1966
                                                Root MSE          =     .85572

                           (Std. err. adjusted for 100 clusters in village_id)
------------------------------------------------------------------------------
             |               Robust
prenatal_v~s | Coefficient  std. err.      t    P>|t|     [95% conf. interval]
-------------+----------------------------------------------------------------
   treatment |       -.25   .0534681    -4.68   0.000    -.3560923   -.1439077
      period |       .272   .0542785     5.01   0.000     .1642996    .3797004
  treat_post |       .864    .080151    10.78   0.000      .704963    1.023037
       _cons |       2.06    .038061    54.12   0.000     1.984479    2.135521
------------------------------------------------------------------------------

The coefficient on treat_post is our DiD estimate. In this simulated data, mobile clinics increased prenatal visits by approximately 0.8 visits. This is close to the true effect we built into the simulation.

Assumption

Additional Assumption

Specifically with this case study, we also need to make some additional assumption :

  1. No Spillovers : No spillovers between villages (which we address by cluster randomization in village levels)
  2. No anticipation effects : Which assume that women didn’t change behavior before mobile clinic arrived

Conclusion

Difference-in-Differences is a method where we compare changes over time between groups that did and didn’t receive treatment. With DiD we also able to remove both time-invariant differences between groups(existing sifference between control and treatment before the intervention started).

The method requires careful attention to Parallel trends assumptionwhich can’t be tested directly but can be signalled by analysing pre-treatment data.

Additionally, consider the cost and logistics, we also need to consider the Unit of randomization to cluster at the village level to avoid SUTVA violations

On a practical note, implementing DiD in Stata is straightforward because in essence it’s just a regression with an interaction term. The harder part is thinking through the assumptions and whether they’re plausible for the case at hand.