Estimate high-dimensional additive models using spectral deconfounding (Scheidegger et al. 2025) . The covariates are expanded into B-spline basis functions. A spectral transformation is used to remove bias arising from hidden confounding and a group lasso objective is minimized to enforce component-wise sparsity. Optimal number of basis functions per component and sparsity penalty are chosen by cross validation.
SDAM(
formula = NULL,
data = NULL,
x = NULL,
y = NULL,
Q_type = "trim",
trim_quantile = 0.5,
q_hat = 0,
nfolds = 5,
cv_method = "1se",
n_K = 4,
n_lambda1 = 10,
n_lambda2 = 20,
Q_scale = TRUE,
ind_lin = NULL,
mc.cores = 1,
verbose = TRUE,
notRegularized = NULL
)
Object of class formula
or describing the model to fit
of the form y ~ x1 + x2 + ...
where y
is a numeric response and
x1, x2, ...
are vectors of covariates. Interactions are not supported.
Training data of class data.frame
containing the variables in the model.
Matrix of covariates, alternative to formula
and data
.
Vector of responses, alternative to formula
and data
.
Type of deconfounding, one of 'trim', 'pca', 'no_deconfounding'.
'trim' corresponds to the Trim transform (Ćevid et al. 2020)
as implemented in the Doubly debiased lasso (Guo et al. 2022)
,
'pca' to the PCA transformation(Paul et al. 2008)
.
See get_Q
.
Quantile for Trim transform,
only needed for trim, see get_Q
.
Assumed confounding dimension, only needed for pca,
see get_Q
.
The number of folds for cross-validation. Default is 5.
The method for selecting the regularization parameter during cross-validation. One of "min" (minimum cv-loss) and "1se" (one-standard-error rule) Default is "1se".
The number of candidate values for the number of basis functions for B-splines. Default is 4.
The number of candidate values for the regularization parameter in the initial cross-validation step. Default is 10.
The number of candidate values for the regularization parameter in the second stage of cross-validation (once the optimal number of basis function K is decided, a second stage of cross-validation for the regularization parameter is performed on a finer grid). Default is 20.
Should data be scaled to estimate the spectral transformation?
Default is TRUE
to not reduce the signal of high variance covariates.
A vector of indices specifying which covariates to model linearly (i.e. not expanded into basis function). Default is `NULL`.
Number of cores to use for parallel processing, if mc.cores > 1
the cross validation is parallelized. Default is `1`. (only supported for unix)
If TRUE
fitting information is shown.
A vector of indices specifying which covariates not to regularize. Default is `NULL`.
An object of class `SDAM` containing the following elements:
The original design matrix.
The number of covariates in `X`.
Names of the covariates in the training data.
The intercept term of the fitted model.
A vector of the number of basis functions for each covariate, where 1 corresponds to a linear term. The entries of the vector will mostly by the same, but some entries might be lower if the corresponding component of X contains only few unique values.
A list of breakpoints used for the B-splines. Used to reconstruct the B-spline basis functions.
A list of coefficients for the B-spline basis functions for each component.
A vector of active covariates that contribute to the model.
Ćevid D, Bühlmann P, Meinshausen N (2020).
“Spectral Deconfounding via Perturbed Sparse Linear Models.”
J. Mach. Learn. Res., 21(1).
ISSN 1532-4435, http://jmlr.org/papers/v21/19-545.html.
Guo Z, Ćevid D, Bühlmann P (2022).
“Doubly debiased lasso: High-dimensional inference under hidden confounding.”
The Annals of Statistics, 50(3).
ISSN 0090-5364, doi:10.1214/21-AOS2152
.
Paul D, Bair E, Hastie T, Tibshirani R (2008).
““Preconditioning” for feature selection and regression in high-dimensional problems.”
The Annals of Statistics, 36(4).
ISSN 0090-5364, doi:10.1214/009053607000000578
.
Scheidegger C, Guo Z, Bühlmann P (2025).
“Spectral Deconfounding for High-Dimensional Sparse Additive Models.”
ACM / IMS J. Data Sci..
doi:10.1145/3711116
.
set.seed(1)
X <- matrix(rnorm(10 * 5), ncol = 5)
Y <- sin(X[, 1]) - X[, 2] + rnorm(10)
model <- SDAM(x = X, y = Y, Q_type = "trim", trim_quantile = 0.5, nfold = 2, n_K = 1)
#> [1] "Initial cross-validation"
#> [1] "Second stage cross-validation"
# if we know that the first covariate one is relevant, we can also choose to not regularize it
model <- SDAM(x = X, y = Y, Q_type = "trim", trim_quantile = 0.5, nfold = 2,
n_K = 1, notRegularized = c(1))
#> [1] "Initial cross-validation"
#> [1] "Second stage cross-validation"
# \donttest{
set.seed(22)
library(HDclassif)
#> Loading required package: MASS
data(wine)
names(wine) <- c("class", "alcohol", "malicAcid", "ash", "alcalinityAsh", "magnesium",
"totPhenols", "flavanoids", "nonFlavPhenols", "proanthocyanins",
"colIntens", "hue", "OD", "proline")
wine <- log(wine)
# estimate model
# do not use class in the model and restrict proline to be linear
model <- SDAM(alcohol ~ . - class, wine, ind_lin = "proline")
#> [1] "Initial cross-validation"
#> [1] "Second stage cross-validation"
# extract variable importance
varImp(model)
#> malicAcid ash alcalinityAsh magnesium totPhenols
#> 2.211114e-06 4.626919e-06 0.000000e+00 0.000000e+00 0.000000e+00
#> flavanoids nonFlavPhenols proanthocyanins colIntens hue
#> 1.000657e-06 3.530140e-06 0.000000e+00 3.331735e-04 0.000000e+00
#> OD proline
#> 0.000000e+00 0.000000e+00
# most important variable
mostImp <- names(which.max(varImp(model)))
mostImp
#> [1] "colIntens"
# predict for individual Xj
x <- seq(min(wine[, mostImp]), max(wine[, mostImp]), length.out = 100)
predJ <- predict_individual_fj(object = model, j = mostImp, x = x)
plot(x, predJ,
xlab = paste0("log ", mostImp), ylab = "log alcohol")
# partial dependece
plot(partDependence(model, mostImp))
# predict
predict(model, newdata = wine[42, ])
#> [1] 2.56437
## alternative function call
mod_none <- SDAM(x = as.matrix(wine[1:10, -c(1, 2)]), y = wine$alcohol[1:10],
Q_type = "no_deconfounding", nfolds = 2, n_K = 4,
n_lambda1 = 4, n_lambda2 = 8)
#> [1] "Initial cross-validation"
#> [1] "Second stage cross-validation"
# }