Simulation of data from a confounded non-linear model. Where the non-linear function is a random regression tree. The data generating process is given by: $$Y = f(X) + \delta^T H + \nu$$ $$X = \Gamma^T H + E$$ where \(f(X)\) is a random regression tree with \(m\) random splits of the data. Resulting in a random step-function with \(m+1\) levels, i.e. leaf-levels. $$f(x_i) = \sum_{k = 1}^K 1_{\{x_i \in R_k\}} c_k$$ \(E\), \(\nu\) are random error terms and \(H \in \mathbb{R}^{n \times q}\) is a matrix of random confounding covariates. \(\Gamma \in \mathbb{R}^{q \times p}\) and \(\delta \in \mathbb{R}^{q}\) are random coefficient vectors. For the simulation, all the above parameters are drawn from a standard normal distribution, except for \(\delta\) which is drawn from a normal distribution with standard deviation 10. For a split a covariate is sampled uniformly and split at a random point using a beta distribution (with both shape parameters equal 2) on the support of the chosen covariate. The leaf levels \(c_k\) are drawn from a uniform distribution between \(cl\) and \(cu\).

simulate_data_step(q, p, n, m, make_tree = FALSE, cl = -50, cu = 50)

Arguments

q

number of confounding covariates in H

p

number of covariates in X

n

number of observations

m

number of splits done using a random covariate

make_tree

Whether the random regression tree should be returned.

cl

lower limit of the uniform distribution of the step levels

cu

upper limit of the uniform distribution of the step levels

Value

a list containing the simulated data:

X

a matrix of covariates

Y

a vector of responses

f_X

a vector of the true function f(X)

j

the indices of the causal covariates in X

tree

If make_tree, the random regression tree of class SDTree

References

There are no references for Rd macro \insertAllCites on this help page.

Author

Markus Ulmer

Examples

set.seed(42)
# simulation of confounded data
sim_data <- simulate_data_step(q = 2, p = 15, n = 100, m = 2, make_tree = TRUE)
X <- sim_data$X
Y <- sim_data$Y

all(predict(sim_data$tree, data.frame(X)) == sim_data$f_X)
#> [1] TRUE
plot(regPath(sim_data$tree))