Simulate data with linear confounding and causal effect following a step-function

Simulation of data from a confounded non-linear model. Where the non-linear function is a random regression tree. The data generating process is given by: $$Y = f(X) + \delta^T H + \nu$$ $$X = \Gamma^T H + E$$ where $f(X)$ is a random regression tree with $m$ random splits of the data. Resulting in a random step-function with $m+1$ levels, i.e. leaf-levels. $$f(x_i) = \sum_{k = 1}^K 1_{\{x_i \in R_k\}} c_k$$ $E$, $\nu$ are random error terms and $H \in \mathbb{R}^{n \times q}$ is a matrix of random confounding covariates. $\Gamma \in \mathbb{R}^{q \times p}$ and $\delta \in \mathbb{R}^{q}$ are random coefficient vectors. For the simulation, all the above parameters are drawn from a standard normal distribution, except for $\delta$ which is drawn from a normal distribution with standard deviation 10. For a split a covariate is sampled uniformly and split at a random point using a beta distribution (with both shape parameters equal 2) on the support of the chosen covariate. The leaf levels $c_k$ are drawn from a uniform distribution between $cl$ and $cu$.

simulate_data_step(q, p, n, m, make_tree = FALSE, cl = -50, cu = 50)

Arguments

q: number of confounding covariates in H
p: number of covariates in X
n: number of observations
m: number of splits done using a random covariate
make_tree: Whether the random regression tree should be returned.
cl: lower limit of the uniform distribution of the step levels
cu: upper limit of the uniform distribution of the step levels

Value

a list containing the simulated data:

X: a matrix of covariates
Y: a vector of responses
f_X: a vector of the true function f(X)
j: the indices of the causal covariates in X
tree: If make_tree, the random regression tree of class SDTree

References

There are no references for Rd macro \insertAllCites on this help page.

Author

Markus Ulmer

Examples

set.seed(42)
# simulation of confounded data
sim_data <- simulate_data_step(q = 2, p = 15, n = 100, m = 2, make_tree = TRUE)
X <- sim_data$X
Y <- sim_data$Y

all(predict(sim_data$tree, data.frame(X)) == sim_data$f_X)
#> [1] TRUE
plot(regPath(sim_data$tree))