The didnp package contains tools for computing average treatment effect parameters in a Difference-in-Differences setup without specifying a model.
You can install didnp from CRAN with:
install.packages("didnp", dependencies = TRUE)
or get the latest version from github with:
if ( !require("devtools") ) install.packages("devtools"); library(devtools)
devtools::install_github("OlegBadunenko/didnp")
The following is a simplified example …, which comes from that.
Data set is shared by Kuka et al, 2020. Here we showcase the functionality of the package by using a subset, which is available in the package and can be loaded by
library(didnp)
library(ggplot2)
data(DACAsub, package = "didnp")
head(DACAsub)
#R> inschool hs scol post elig fem race var.bpl state year age yrimmig ageimmig
#R> 1 0 1 1 1 0 1 3 4 1 2015 27 1988 0
#R> 2 0 1 1 0 0 1 1 0 1 2007 29 1978 0
#R> 3 1 0 0 1 0 1 4 4 1 2013 16 2000 3
#R> 4 0 0 0 1 1 0 3 3 1 2012 23 1990 1
#R> 5 0 1 1 1 0 1 3 3 1 2014 26 1987 -1
#R> 6 0 1 0 0 0 1 2 3 1 2006 21 1986 1
#R> a1418 a1922 a2330 htus perwt
#R> 1 NA NA 1 NA 32
#R> 2 NA NA 1 NA 39
#R> 3 1 NA NA NA 119
#R> 4 NA NA 1 NA 11
#R> 5 NA NA 1 NA 111
#R> 6 NA 1 NA NA 320
The description of the dataset can be found by typing
?DACAsub
Although this can be done on the fly, the subsample can be prepared beforehand:
# get the subsample
DACAsub$sub_a1418_all <- mysmpl <-
DACAsub$a1418==1 & !is.na(DACAsub$a1418)
table(DACAsub$sub_a1418_all)
#R>
#R> FALSE TRUE
#R> 215653 114453
# generate 'treatment_period'
DACAsub$treatment_period <- ifelse( DACAsub[,"year"] > 2011, 1, 0)
Define the formula that we will use:
form1 <- inschool ~ fem + race + var.bpl + state + age + yrimmig +
ageimmig | inschool | year | elig | treatment_period | perwt
To obtain standard errors and perform testing,
we will use a few number of bootstrap replicaitons here, but we advise to set
boot.num = 399
or larger.
B <- 99
To test if there is a violation of the bias stability condition use command didnptest
tym1test <- didnpbsctest(
form1,
data = DACAsub,
subset = mysmpl,
boot.num = 99,
print.level = 2,
cores = 16)
#R> Number of Observations is 114453
#R> Number of Unordered Categorical Regressors is 4
#R> Number of Ordered Categorical Regressors is 3
#R>
#R> Bandwidths are chosen via the plug-in method
#R>
#R> Calculating BSC
#R> BSC = 0.1345672413
#R> Calculating BSC completed in 8 seconds
#R>
#R> Bootstrapping the statistic (99 replications)
#R> Calculating residuals for the alternative model
#R> Calculating residuals for the alternative model completed in 36 seconds
#R> Calculating fitted values under the null hypothesis
#R> Calculating fitted values under the null hypothesis completed in 47 seconds
#R>
#R> The main Loop of the Bootstrapping started
#R> Bootstrapping the statistic completed in 6 minutes and 42 seconds
#R>
#R> BSC stat = 0.1346
#R> BSC sd = 0.003794
#R> p-value = 0.99
We don’t find evidence against the null hypothesis that the bias stability condition holds.
To estimate the average treatment effects, we use the didnpreg function. The didnpreg function allows using matrices. The manual explains how to use matrix syntax (type ?didnpreg
).
To speed up the estimation
on computers with multiple cores, use multiplrocessing by setting option
cores
.
Suppress output by setting print.level = 0
. The default value is 1.
# suppress output
tym1a <- didnpreg(
form1,
data = DACAsub,
subset = mysmpl,
bwmethod = "opt",
boot.num = B,
TTx = "TTa",
print.level = 2,
digits = 8,
cores = 16)
#R> Number of Observations is 114453
#R> Number of Unordered Categorical Regressors is 4
#R> Number of Ordered Categorical Regressors is 3
#R>
#R> Bandwidths are chosen via the plug-in method
#R>
#R> Regressor Type Bandwidth
#R> 1 fem factor 1.901226e-04
#R> 2 race factor 4.329600e-05
#R> 3 var.bpl factor 2.897670e-05
#R> 4 state factor 3.629340e-06
#R> 5 age ordered 4.737941e-05
#R> 6 yrimmig ordered 8.688889e-06
#R> 7 ageimmig ordered 1.697394e-05
#R>
#R> Calculating ATET: TTa
#R> TTa = 0.0076709834, N(TTa) = 5387
#R> Calculating ATET completed in 2 seconds
#R>
#R> Bootstrapping standard errors (99 replications)
#R> Calculating residuals completed
#R> Bootstrapping standard errors completed in 1 minute and 46 seconds
#R>
#R> TTa sd = 0.005820258
didnpreg returns a class didnp object. This object contains estimates of the average treatment effects and their standard errors. To see these, we can call the summary function.
# Print the summary of estimation
summary(tym1a)
#R> Number of Observations is 114453
#R> Number of Unordered Categorical Regressors is 4
#R> Number of Ordered Categorical Regressors is 3
#R>
#R> Bandwidths are chosen via the plug-in method
#R>
#R> Regressor Type Bandwidth
#R> 1 fem factor 1.901226e-04
#R> 2 race factor 4.329600e-05
#R> 3 var.bpl factor 2.897670e-05
#R> 4 state factor 3.629340e-06
#R> 5 age ordered 4.737941e-05
#R> 6 yrimmig ordered 8.688889e-06
#R> 7 ageimmig ordered 1.697394e-05
#R>
#R> Bootstrapping standard errors (99 replications) completed in 1 minute and 46 seconds
#R>
#R> Unconditional Treatment Effect on the Treated (ATET):
#R>
#R> TTa = 0.007671
#R> TTa sd = 0.00582
#R> N(TTa) = 5387
rm(tym1a)
Estimating TTb will take longer. The bandwidths is cross-validated.
# Show output as the estimation goes
tym1b <- didnpreg(
form1,
data = DACAsub,
subset = mysmpl,
bwmethod = "CV",
boot.num = B,
TTx = "TTb",
print.level = 2,
digits = 8,
cores = 16)
#R> Number of Observations is 114453
#R> Number of Unordered Categorical Regressors is 4
#R> Number of Ordered Categorical Regressors is 3
#R>
#R> Calculating cross-validated bandwidths
#R> Kernel Type for Unordered Categorical Regressors is Aitchison and Aitken
#R> Kernel Type for Ordered Categorical is Li and Racine
#R> Calculating cross-validated bandwidths completed in 15 seconds
#R>
#R> Regressor Type Bandwidth
#R> 1 fem factor 1.827072e-04
#R> 2 race factor 4.392238e-05
#R> 3 var.bpl factor 2.273964e-05
#R> 4 state factor 1.096748e-04
#R> 5 age ordered 4.768492e-05
#R> 6 yrimmig ordered 4.144541e-05
#R> 7 ageimmig ordered 4.216168e-05
#R>
#R> Calculating ATET: TTb
#R> TTb = 0.02163795, N(TTb) = 56959
#R> Calculating ATET completed in 22 seconds
#R>
#R> Bootstrapping standard errors (99 replications)
#R> Calculating residuals completed
#R> Bootstrapping standard errors completed in 18 minutes and 51 second
#R> TTb sd = 0.0072693185
To plot the heterogenous treatment effects, use the didnpplot command. Define three variables by and over which the treatment effects will be plotted:
DACAsub[tym1b$esample, "race"] -> race
DACAsub[tym1b$esample, "fem"] -> sex
as.numeric(DACAsub[tym1b$esample, "age"]) -> age
Here age
is quasi-continuous. It will have only 4 values (in fact fewer than the Race variable), but we use it to showcase the functionality.
First, use one categorical by
variable:
tym1b_gr_race <- didnpplot(
obj = tym1b,
level = 95,
by = race[tym1b$sample1],
xlab = "Race",
ylab = "ATET",
by.labels.values = data.frame(
old = c(1,2,3,4,5),
new = c("Hispanic", "White", "Black", "Asian", "Other")
))
#R> [1] "1.1 TTa and TTb"
#R> [1] "1.1.1.1 TTb"
tym1b_gr_race$data.a
#R> atet atet.sd count by byold
#R> 1 0.011993717 0.008270328 3602 Hispanic 1
#R> 2 0.011270311 0.004412343 499 White 2
#R> 3 0.022788871 0.007245476 344 Black 3
#R> 4 0.003634539 0.007069134 849 Asian 4
#R> 5 0.048865605 0.008149357 93 Other 5
tym1b_gr_race$data.b
#R> atet atet.sd count by byold
#R> 1 0.030191044 0.010486907 37659 Hispanic 1
#R> 2 0.009465986 0.003469228 6578 White 2
#R> 3 0.036553192 0.006863302 3409 Black 3
#R> 4 -0.011254826 0.007259213 8458 Asian 4
#R> 5 0.004476847 0.007394851 855 Other 5
tym1b_gr_race$plot.a
tym1b_gr_race$plot.b
Here objects data.a
and data.b
contain data that is used to produce plot.a
and plot.b
. The graphs are ggplot
objects and can be amended further.
Another example is the graph with treatment effects by sex. Note the sample object tym1b
is used:
tym1b_gr_sex <- didnpplot(
obj = tym1b,
level = 95,
by = sex[tym1b$sample1],
xlab = "Sex",
ylab = "ATET",
by.labels.values = data.frame(c(1,0), c("Female", "Male"))
)
#R> [1] "1.1 TTa and TTb"
#R> [1] "1.1.1.1 TTb"
tym1b_gr_sex$data.a
#R> atet atet.sd count by byold
#R> 1 0.008272147 0.009344462 2724 Male 0
#R> 2 0.015682133 0.006518702 2663 Female 1
tym1b_gr_sex$data.b
#R> atet atet.sd count by byold
#R> 1 0.037992268 0.011913971 29805 Male 0
#R> 2 0.003686988 0.009116154 27154 Female 1
tym1b_gr_sex$plot.a
tym1b_gr_sex$plot.b
The didnpplot command will recognize if by
is continuous variable, split it into given number of intervals n.intervals
and plot the treatment effects by the split variable.
tym1b_gr_age <- didnpplot(
obj = tym1b,
level = 95,
by = age[tym1b$sample1],
n.intervals = 10,
xlab = "Age"
)
#R> [1] "2. 'by' is a continuous"
#R> [1] "2.1 TTa and TTb"
#R> [1] "2.1.1 single 'by'"
#R> [1] "2.1.1.1 TTb"
#R> d1b:
#R> atet atet.sd count by
#R> 1 -0.011711563 0.008598465 11750 (1,1.4]
#R> 3 -0.008594248 0.005577546 11855 (1.8,2.2]
#R> 5 0.019034473 0.006023782 11479 (2.6,3]
#R> 8 0.033056671 0.009346651 11078 (3.8,4.2]
#R> 10 0.082177748 0.016793647 10797 (4.6,5]
tym1b_gr_age$data.a
#R> atet atet.sd count by
#R> 1 -0.009763996 0.011617326 1059 (1,1.4]
#R> 3 -0.034153071 0.007341389 1100 (1.8,2.2]
#R> 5 -0.002217522 0.007659071 1062 (2.6,3]
#R> 8 0.002805505 0.011864983 1087 (3.8,4.2]
#R> 10 0.103344508 0.017867100 1079 (4.6,5]
tym1b_gr_age$data.b
#R> atet atet.sd count by
#R> 1 -0.011711563 0.008598465 11750 (1,1.4]
#R> 3 -0.008594248 0.005577546 11855 (1.8,2.2]
#R> 5 0.019034473 0.006023782 11479 (2.6,3]
#R> 8 0.033056671 0.009346651 11078 (3.8,4.2]
#R> 10 0.082177748 0.016793647 10797 (4.6,5]
tym1b_gr_age$plot.a
tym1b_gr_age$plot.b
Ameding ggplot object is easy. For example adding a 0 horizontal line is
tym1b_gr_age$plot.a +
geom_hline(yintercept = 0)
Anternatively, one can use the
data.a
anddata.b
objects to plot from scratch.
The treatment effects can be visualized by and over. Variable by
can be both categorical and continuous, while over
must be categorical. For example, to plot treatment effects by age over race, specify the over
option:
tym1b_gr_age_race <- didnpplot(
obj = tym1b,
level = 90,
by = age[tym1b$sample1],
n.intervals = 7,
over = race[tym1b$sample1],
xlab = "Age",
ylab = "ATET",
point_size = 2,
over.labels.values = data.frame(
old = c(1,2,3,4,5),
new = c("Hispanic", "White", "Black", "Asian", "Other")
),
text_size = 15)
#R> [1] "2. 'by' is a continuous"
#R> [1] "2.1 TTa and TTb"
#R> [1] "2.1.2.1 TTb"
#R> [1] "2.1.2.2 TTa"
tym1b_gr_age_race$data.a
#R> atet atet.sd count by over overold
#R> 1 -0.007039513 0.0172339593 697 (1,1.6] Hispanic 1
#R> 2 -0.041935551 0.0107298586 713 (1.6,2.1] Hispanic 1
#R> 3 -0.001365237 0.0107714326 731 (2.7,3.3] Hispanic 1
#R> 4 -0.002540638 0.0171598570 746 (3.9,4.4] Hispanic 1
#R> 5 0.113148615 0.0252891431 715 (4.4,5] Hispanic 1
#R> 6 -0.032891343 0.0073197309 88 (1,1.6] White 2
#R> 7 -0.030346989 0.0119942518 108 (1.6,2.1] White 2
#R> 8 -0.028598696 0.0018182087 96 (2.7,3.3] White 2
#R> 9 -0.021056014 0.0005829334 93 (3.9,4.4] White 2
#R> 10 0.144732303 0.0142190691 114 (4.4,5] White 2
#R> 11 0.012774575 0.0058123900 71 (1,1.6] Black 3
#R> 12 -0.062639092 0.0208847478 87 (1.6,2.1] Black 3
#R> 13 0.021895735 0.0021465991 59 (2.7,3.3] Black 3
#R> 14 0.004154642 0.0088249859 60 (3.9,4.4] Black 3
#R> 15 0.161803746 0.0207738025 67 (4.4,5] Black 3
#R> 16 -0.016768859 0.0150240621 186 (1,1.6] Asian 4
#R> 17 0.001445857 0.0094433920 176 (1.6,2.1] Asian 4
#R> 18 0.004486462 0.0084206127 158 (2.7,3.3] Asian 4
#R> 19 0.022800879 0.0104912133 171 (3.9,4.4] Asian 4
#R> 20 0.008496514 0.0279074621 158 (4.4,5] Asian 4
#R> 21 -0.019239967 0.0095726548 17 (1,1.6] Other 5
#R> 22 0.050267154 0.0035945665 16 (1.6,2.1] Other 5
#R> 23 -0.034014171 0.0002677526 18 (2.7,3.3] Other 5
#R> 24 0.162051767 0.0105645363 17 (3.9,4.4] Other 5
#R> 25 0.076987253 0.0286543029 25 (4.4,5] Other 5
tym1b_gr_age_race$plot.a
tym1b_gr_age_race$data.b
#R> atet atet.sd count by over overold
#R> 1 -0.0178987726 0.013545149 7445 (1,1.6] Hispanic 1
#R> 2 -0.0124467647 0.008118619 7742 (1.6,2.1] Hispanic 1
#R> 3 0.0308134996 0.008716494 7630 (2.7,3.3] Hispanic 1
#R> 4 0.0426301730 0.013282137 7479 (3.9,4.4] Hispanic 1
#R> 5 0.1103688288 0.023482998 7363 (4.4,5] Hispanic 1
#R> 6 -0.0172003123 0.002950034 1441 (1,1.6] White 2
#R> 7 -0.0058337479 0.003972801 1381 (1.6,2.1] White 2
#R> 8 -0.0191948879 0.002797274 1304 (2.7,3.3] White 2
#R> 9 0.0170784015 0.004262117 1264 (3.9,4.4] White 2
#R> 10 0.0829565187 0.008679534 1188 (4.4,5] White 2
#R> 11 0.0135585036 0.007607076 792 (1,1.6] Black 3
#R> 12 0.0110111496 0.011379642 735 (1.6,2.1] Black 3
#R> 13 0.0425241053 0.004037403 667 (2.7,3.3] Black 3
#R> 14 0.0325462907 0.007108259 660 (3.9,4.4] Black 3
#R> 15 0.1007822917 0.015526943 555 (4.4,5] Black 3
#R> 16 0.0042777679 0.005679847 1882 (1,1.6] Asian 4
#R> 17 -0.0037632095 0.005535793 1841 (1.6,2.1] Asian 4
#R> 18 -0.0083594589 0.006229344 1713 (2.7,3.3] Asian 4
#R> 19 -0.0006320413 0.009655409 1528 (3.9,4.4] Asian 4
#R> 20 -0.0542372818 0.019582282 1494 (4.4,5] Asian 4
#R> 21 0.0086424091 0.002637867 190 (1,1.6] Other 5
#R> 22 0.0087777397 0.005509135 156 (1.6,2.1] Other 5
#R> 23 -0.0340846852 0.006744155 165 (2.7,3.3] Other 5
#R> 24 0.0358425120 0.009527052 147 (3.9,4.4] Other 5
#R> 25 0.0059464119 0.018826402 197 (4.4,5] Other 5
tym1b_gr_age_race$plot.b
Note that the graph shows the 90% confidence interval.
Alternatively use the data from the object
tym1b_gr_age_race
to produce another type of graph:
crit.value <- 2
pd <- position_dodge(0.1) # move them .05 to the left and right
d1 <- tym1b_gr_age_race$data.b
d1$Race <- d1$over
ggplot(d1, aes(x = by, y = atet, color = Race, group = Race)) +
geom_errorbar(aes(ymin = atet - crit.value*atet.sd, ymax = atet + crit.value*atet.sd), color = "black", width = .1, position = pd) +
geom_line(position = pd) +
geom_point(position = pd, size = 3, shape = 21, fill = "white") +
xlab("Age") +
ylab("ATET") +
theme_bw() +
theme(legend.position = "right", text = element_text(size = 17))
The next plot shows treatment effects by age over sex:
tym1b_gr_age_sex <- didnpplot(
obj = tym1b,
level = 90,
by = age[tym1b$sample1],
n.intervals = 7,
over = sex[tym1b$sample1],
xlab = "Age",
ylab = "ATET",
over.lab = "Sex",
point_size = 2,
over.labels.values = data.frame(c(1,0), c("Female", "Male")),
text_size = 15)
#R> [1] "2. 'by' is a continuous"
#R> [1] "2.1 TTa and TTb"
#R> [1] "2.1.2.1 TTb"
#R> [1] "2.1.2.2 TTa"
tym1b_gr_age_sex$data.a
#R> atet atet.sd count by over overold
#R> 1 0.0084251925 0.019625551 543 (1,1.6] Male 0
#R> 2 -0.0455180791 0.011082620 559 (1.6,2.1] Male 0
#R> 3 0.0002534017 0.011717491 539 (2.7,3.3] Male 0
#R> 4 -0.0281223862 0.013361511 560 (3.9,4.4] Male 0
#R> 5 0.1128394056 0.025844176 523 (4.4,5] Male 0
#R> 6 -0.0289049436 0.006804722 516 (1,1.6] Female 1
#R> 7 -0.0224099298 0.010199053 541 (1.6,2.1] Female 1
#R> 8 -0.0047640388 0.009112108 523 (2.7,3.3] Female 1
#R> 9 0.0356700572 0.017157482 527 (3.9,4.4] Female 1
#R> 10 0.0944131556 0.024071741 556 (4.4,5] Female 1
tym1b_gr_age_sex$plot.a
tym1b_gr_age_sex$data.b
#R> atet atet.sd count by over overold
#R> 1 0.007001133 0.015882541 6060 (1,1.6] Male 0
#R> 2 -0.003959683 0.009232123 6090 (1.6,2.1] Male 0
#R> 3 0.039319598 0.009712359 5997 (2.7,3.3] Male 0
#R> 4 0.040067645 0.013426449 5950 (3.9,4.4] Male 0
#R> 5 0.112096184 0.024897568 5708 (4.4,5] Male 0
#R> 6 -0.031641077 0.004780360 5690 (1,1.6] Female 1
#R> 7 -0.013490084 0.006143066 5765 (1.6,2.1] Female 1
#R> 8 -0.003156313 0.007524146 5482 (2.7,3.3] Female 1
#R> 9 0.024921864 0.012189276 5128 (3.9,4.4] Female 1
#R> 10 0.048620186 0.026697396 5089 (4.4,5] Female 1
tym1b_gr_age_sex$plot.b
Finally, both by
and over
are both categorical:
tym1b_gr_sex_race <- didnpplot(
obj = tym1b,
level = 95,
by = sex[tym1b$sample1],
over = race[tym1b$sample1],
xlab = "Sex",
ylab = "ATET",
over.lab = "Race",
point_size = 3,
by.labels.values = data.frame(c(1,0), c("Female", "Male")),
over.labels.values = data.frame(
old = c(1,2,3,4,5),
new = c("Hispanic", "White", "Black", "Asian", "Other")
),
text_size = 17)
#R> [1] "1.1 TTa and TTb"
#R> [1] "1.1.2 double 'by'"
tym1b_gr_sex_race$data.a
#R> atet atet.sd count by byold over overold
#R> 1 0.016041723 0.013442096 1827 Male 0 Hispanic 1
#R> 2 0.007827121 0.008996835 1775 Female 1 Hispanic 1
#R> 3 0.039770758 0.005462257 236 Female 1 White 2
#R> 4 -0.014304234 0.006392819 263 Male 0 White 2
#R> 5 0.065237651 0.009809736 168 Female 1 Black 3
#R> 6 -0.017730418 0.009387472 176 Male 0 Black 3
#R> 7 0.009721911 0.009078210 435 Female 1 Asian 4
#R> 8 -0.002761613 0.009593442 414 Male 0 Asian 4
#R> 9 0.028431309 0.003980957 44 Male 0 Other 5
#R> 10 0.067214769 0.014719898 49 Female 1 Other 5
tym1b_gr_sex_race$plot.a
tym1b_gr_sex_race$data.b
#R> atet atet.sd count by byold over overold
#R> 1 0.057086535 0.017358836 19796 Male 0 Hispanic 1
#R> 2 0.000385125 0.013520005 17863 Female 1 Hispanic 1
#R> 3 0.022905856 0.003249127 3153 Female 1 White 2
#R> 4 -0.002906543 0.006026023 3425 Male 0 White 2
#R> 5 0.047840628 0.009338561 1669 Female 1 Black 3
#R> 6 0.025726336 0.009448019 1740 Male 0 Black 3
#R> 7 -0.011550742 0.009671313 4059 Female 1 Asian 4
#R> 8 -0.010981782 0.008903791 4399 Male 0 Asian 4
#R> 9 0.035448152 0.005266577 445 Male 0 Other 5
#R> 10 -0.029138349 0.011503129 410 Female 1 Other 5
tym1b_gr_sex_race$plot.b