Type: | Package |
Title: | Entrywise Splitting Cross-Validation for Factor Models |
Version: | 1.0.1 |
Description: | Implements entrywise splitting cross-validation (ECV) and its penalized variant (pECV) for selecting the number of factors in generalized factor models. |
License: | GPL-3 |
Encoding: | UTF-8 |
Language: | en-US |
Depends: | R (≥ 3.5.0) |
Imports: | stats, Rcpp (≥ 1.0.0), irlba |
Suggests: | mirtjml, testthat (≥ 3.0.0) |
LinkingTo: | Rcpp, RcppArmadillo |
URL: | https://github.com/wangATsu/ECV |
BugReports: | https://github.com/wangATsu/ECV/issues |
RoxygenNote: | 7.3.2 |
Config/testthat/edition: | 3 |
ByteCompile: | true |
NeedsCompilation: | yes |
Packaged: | 2025-08-23 02:29:40 UTC; clswt-wangzhijing |
Author: | Zhijing Wang [aut, cre] |
Maintainer: | Zhijing Wang <wangzhijing@sjtu.edu.cn> |
Repository: | CRAN |
Date/Publication: | 2025-08-28 08:50:07 UTC |
Estimate constraint constant C for continuous data
Description
Data-driven estimation of the constraint constant C in alternating maximization algorithm for continuous data using truncated SVD approach. This function decomposes the data matrix and estimates C based on the maximum row norms.
Usage
estimate_C(X, qmax = 8, safety = 1.2)
Arguments
X |
n x p continuous data matrix |
qmax |
Rank for truncated SVD (default 8) |
safety |
Safety parameter for conservative estimation (default 1.2) |
Details
The function performs the following steps: 1. Computes truncated SVD of X with rank qmax 2. Constructs factor matrices A = U * sqrt(D) and B = V * sqrt(D) 3. Calculates row 2-norms for matrices A and B 4. Takes the maximum norm and multiplies by safety parameter
For count data, it is recommended to transform the data using log(X + 1) before applying this function.
Value
A list containing:
qmax |
Truncation rank used |
safety |
Safety parameter applied |
C_norm_hat |
Original maximum row norm |
C_est |
Final conservative estimate of C |
a_norms |
Row norms of factor matrix A |
b_norms |
Row norms of factor matrix B |
Examples
# Example 1: Continuous data
set.seed(123)
n <- 100; p <- 50; q <- 3
theta_true <- matrix(runif(n * q), n, q)
A_true <- matrix(runif(p * q), p, q)
X <- theta_true %*% t(A_true) + matrix(rnorm(n * p, sd = 0.5), n, p)
# Estimate C
C_result <- estimate_C(X, qmax = 5)
print(C_result$C_est)
# Example 2: Count data (apply log transformation)
lambda <- exp(theta_true %*% t(A_true))
X_count <- matrix(rpois(n * p, lambda = as.vector(lambda)), n, p)
X_transformed <- log(X_count + 1)
C_count <- estimate_C(X_transformed, qmax = 5)
print(C_count$C_est)
Estimate constraint constant C for binary data
Description
Data-driven estimation of the constraint constant C for binary data using cross-window smoothing and empirical logit transformation.
Usage
estimate_C_binary(X, qmax = 8, safety = 1.5, eps = 1e-12, radius = 1)
Arguments
X |
n x p binary data matrix (0/1 values) |
qmax |
Rank for truncated SVD (default 8) |
safety |
Safety parameter for conservative estimation (default 1.5) |
eps |
Small constant to avoid logit divergence when p=0 or p=1 (default 1e-12) |
radius |
Radius for cross-window smoothing (default 1) |
Details
The function performs the following steps: 1. Applies cross-window smoothing to estimate probabilities 2. Performs empirical logit transformation with smoothing 3. Computes truncated SVD of the transformed matrix 4. Constructs matrices A and B and calculates row norms 5. Estimates C as the maximum norm times safety parameter
The cross-window smoothing helps stabilize probability estimates, especially for sparse binary data.
Value
A list containing:
radius |
Cross-window radius used |
qmax |
Truncation rank used |
safety |
Safety parameter applied |
C0 |
Original maximum row norm |
C_est |
Final conservative estimate of C |
a_norms |
Row norms of factor matrix A |
b_norms |
Row norms of factor matrix B |
Mhat |
Logit-transformed matrix |
P_smooth |
Smoothed probability matrix |
N_counts |
Count of values in each smoothing window |
Generate binary data example
Description
Generate simulated data from a binary (logistic) factor model.
Usage
generate_binary_data(n = 100, p = 50, q = 3)
Arguments
n |
Integer. Number of observations. |
p |
Integer. Number of variables. |
q |
Integer. True number of latent factors. |
Value
A named list with components:
- resp
Binary matrix (n x p). Generated 0/1 responses.
- true_q
Integer. True number of factors used in simulation.
- theta_true
Numeric matrix (n x q). True latent factor scores.
- A_true
Numeric matrix (p x q). True factor loadings.
- d_true
Numeric vector (length p). Item intercepts.
Generate binary data with missing values
Description
Generate simulated data from a binary (logistic) factor model with missing values.
Usage
generate_binary_data_miss(n = 100, p = 50, q = 3, miss_prop = 0.05)
Arguments
n |
Integer. Number of observations. |
p |
Integer. Number of variables. |
q |
Integer. True number of latent factors. |
miss_prop |
Numeric in (0,1). Proportion of missing values (default 0.05). |
Value
A named list with components:
- resp
Binary matrix (n x p). Generated 0/1 responses with missing values (NA).
- resp_complete
Binary matrix (n x p). Complete data before missingness.
- true_q
Integer. True number of factors used in simulation.
- theta_true
Numeric matrix. True latent factor scores.
- A_true
Numeric matrix. True factor loadings.
- d_true
Numeric vector (length p). Item intercepts.
- miss_prop
Numeric. Proportion of entries set to missing.
Generate continuous data example
Description
Generate simulated data from a Gaussian factor model.
Usage
generate_continuous_data(n = 100, p = 50, q = 3, noise_sd = 1)
Arguments
n |
Integer. Number of observations. |
p |
Integer. Number of variables. |
q |
Integer. True number of latent factors. |
noise_sd |
Numeric. Standard deviation of Gaussian noise. |
Value
A named list with components:
- resp
Numeric matrix (n x p). Generated observed data.
- true_q
Integer. True number of factors used in simulation.
- theta_true
Numeric matrix (n x (q+1)). True latent factor scores with intercept.
- A_true
Numeric matrix (p x (q+1)). True factor loadings.
Generate continuous data with missing values
Description
Generate simulated data from a Gaussian factor model with missing values.
Usage
generate_continuous_data_miss(
n = 100,
p = 50,
q = 3,
noise_sd = 1,
miss_prop = 0.05
)
Arguments
n |
Integer. Number of observations. |
p |
Integer. Number of variables. |
q |
Integer. True number of latent factors. |
noise_sd |
Numeric. Standard deviation of Gaussian noise. |
miss_prop |
Numeric in (0,1). Proportion of missing values (default 0.05). |
Value
A named list with components:
- resp
Numeric matrix (n x p). Generated data with missing values (NA).
- resp_complete
Numeric matrix (n x p). Complete data before missingness.
- true_q
Integer. True number of factors used in simulation.
- theta_true
Numeric matrix (n x (q+1)). True latent factor scores with intercept.
- A_true
Numeric matrix (p x (q+1)). True factor loadings.
- miss_prop
Numeric. Proportion of entries set to missing.
Generate count data example
Description
Generate simulated data from a Poisson factor model.
Usage
generate_count_data(n = 100, p = 50, q = 3)
Arguments
n |
Integer. Number of observations. |
p |
Integer. Number of variables. |
q |
Integer. True number of latent factors. |
Value
A named list with components:
- resp
Integer matrix (n x p). Generated Poisson observations.
- true_q
Integer. True number of factors used in simulation.
- theta_true
Numeric matrix (n x (q+1)). True latent factor scores with intercept.
- A_true
Numeric matrix (p x (q+1)). True factor loadings.
Generate count data with missing values
Description
Generate simulated data from a Poisson factor model with missing values.
Usage
generate_count_data_miss(n = 100, p = 50, q = 3, miss_prop = 0.05)
Arguments
n |
Integer. Number of observations. |
p |
Integer. Number of variables. |
q |
Integer. True number of latent factors. |
miss_prop |
Numeric in (0,1). Proportion of missing values (default 0.05). |
Value
A named list with components:
- resp
Integer matrix (n x p). Generated data with missing values (NA).
- resp_complete
Integer matrix (n x p). Complete data before missingness.
- true_q
Integer. True number of factors used in simulation.
- theta_true
Numeric matrix (n x (q+1)). True latent factor scores with intercept.
- A_true
Numeric matrix (p x (q+1)). True factor loadings.
- miss_prop
Numeric. Proportion of entries set to missing.
Entrywise Splitting Cross-Validation for Factor Models
Description
Uses (Penalized) Entrywise Splitting Cross-Validation (ECV / pECV) to estimate the number of latent factors in generalized factor models.
Usage
pECV(
resp,
C = 5,
qmax = 8,
fold = 5,
tol_val = 0.01,
theta0 = NULL,
A0 = NULL,
seed = 1,
data_type = NULL
)
Arguments
resp |
Observation data matrix (n x p); can be continuous, count, or binary. |
C |
Constraint constant, default is 5. |
qmax |
Maximum number of factors to consider, default is 8. |
fold |
Number of folds in cross-validation, default is 5. |
tol_val |
Convergence tolerance, default is 0.01 (interpreted as 0.01 / number of estimated elements). |
theta0 |
Optional initial matrix for factors; sampled from Uniform if not provided. |
A0 |
Optional initial matrix for loadings; sampled from Uniform if not provided. |
seed |
Random seed, default is 1. |
data_type |
Data type, one of "continuous", "count", "binary". If not specified, it is auto-detected. |
Details
The example below may take more than 5 seconds on some machines and is therefore not run during routine checks.
Value
A named list with components:
- ECV
Integer. Number of factors selected by standard ECV.
- p1ECV
Integer. Number of factors selected by ECV with penalty 1.
- p2ECV
Integer. Number of factors selected by ECV with penalty 2.
- p3ECV
Integer. Number of factors selected by ECV with penalty 3.
- p4ECV
Integer. Number of factors selected by ECV with penalty 4.
- ECV_loss
Numeric vector. Cross-validation loss for each candidate factor number (typically of length
qmax
).- data_type
Character. The detected/used data type:
"continuous"
,"count"
, or"binary"
.
The return value has base R types (no special S3/S4 class).
Examples
set.seed(123)
# Generate count data
n <- 50; p <- 50; q <- 2
theta_true <- cbind(1, matrix(runif(n * q, -2, 2), n, q))
A_true <- matrix(runif(p * (q + 1), -2, 2), p, (q + 1))
lambda <- exp(theta_true %*% t(A_true))
resp <- matrix(
rpois(length(lambda), lambda = as.vector(lambda)),
nrow = nrow(lambda), ncol = ncol(lambda)
)
result <- pECV(resp, C = 4, qmax = 4, fold = 5)
print(result)
Entrywise Splitting Cross-Validation with Missing Data
Description
Uses (Penalized) Entrywise Splitting Cross-Validation to estimate the number of latent factors in generalized factor models when the data contain missing values.
Usage
pECV.miss(
resp,
C = 5,
qmax = 8,
fold = 5,
tol_val = 0.01,
theta0 = NULL,
A0 = NULL,
seed = 1,
data_type = NULL
)
Arguments
resp |
Observation data matrix (n x p) with missing values as |
C |
Constraint constant, default is 5. |
qmax |
Maximum number of factors to consider, default is 8. |
fold |
Number of folds in cross-validation, default is 5. |
tol_val |
Convergence tolerance, default is 0.01 (interpreted as 0.01 / number of estimated elements). |
theta0 |
Optional initial matrix for factors; sampled from Uniform if not provided. |
A0 |
Optional initial matrix for loadings; sampled from Uniform if not provided. |
seed |
Random seed, default is 1. |
data_type |
Data type, one of |
Details
The example below may take more than 5 seconds on some machines and is therefore not run during routine checks.
Value
A named list with components:
- ECV
Integer. Number of factors selected by standard ECV.
- p1ECV
Integer. Number of factors selected by ECV with penalty 1.
- p2ECV
Integer. Number of factors selected by ECV with penalty 2.
- p3ECV
Integer. Number of factors selected by ECV with penalty 3.
- p4ECV
Integer. Number of factors selected by ECV with penalty 4.
- ECV_loss
Numeric vector. Cross-validation loss for each candidate factor number (typically of length
qmax
).- data_type
Character. The detected/used data type:
"continuous"
,"count"
, or"binary"
.- miss_percent
Numeric scalar. Percentage of missing entries in
resp
.
The return value uses base R types (no special S3/S4 class).
Examples
set.seed(123)
# Generate count data with missing values
n <- 50; p <- 50; q <- 2
theta_true <- cbind(1, matrix(runif(n * q, -2, 2), n, q))
A_true <- matrix(runif(p * (q + 1), -2, 2), p, (q + 1))
lambda <- exp(theta_true %*% t(A_true))
resp <- matrix(
rpois(length(lambda), lambda = as.vector(lambda)),
nrow = nrow(lambda), ncol = ncol(lambda)
)
# Introduce 5% missing values
miss_idx <- sample(1:(n * p), size = 0.05 * n * p)
resp[miss_idx] <- NA
result <- pECV.miss(resp, C = 4, qmax = 4, fold = 5)
print(result)