Help for package pECV

Type:

Package

Title:

Entrywise Splitting Cross-Validation for Factor Models

Version:

1.0.1

Description:

Implements entrywise splitting cross-validation (ECV) and its penalized variant (pECV) for selecting the number of factors in generalized factor models.

License:

GPL-3

Encoding:

UTF-8

Language:

en-US

Depends:

R (≥ 3.5.0)

Imports:

stats, Rcpp (≥ 1.0.0), irlba

Suggests:

mirtjml, testthat (≥ 3.0.0)

LinkingTo:

Rcpp, RcppArmadillo

URL:

https://github.com/wangATsu/ECV

BugReports:

https://github.com/wangATsu/ECV/issues

RoxygenNote:

7.3.2

Config/testthat/edition:

ByteCompile:

true

NeedsCompilation:

yes

Packaged:

2025-08-23 02:29:40 UTC; clswt-wangzhijing

Author:

Zhijing Wang [aut, cre]

Maintainer:

Zhijing Wang <wangzhijing@sjtu.edu.cn>

Repository:

CRAN

Date/Publication:

2025-08-28 08:50:07 UTC

Estimate constraint constant C for continuous data

Description

Data-driven estimation of the constraint constant C in alternating maximization algorithm for continuous data using truncated SVD approach. This function decomposes the data matrix and estimates C based on the maximum row norms.

Usage

estimate_C(X, qmax = 8, safety = 1.2)

Arguments

X

n x p continuous data matrix

qmax

Rank for truncated SVD (default 8)

safety

Safety parameter for conservative estimation (default 1.2)

Details

The function performs the following steps: 1. Computes truncated SVD of X with rank qmax 2. Constructs factor matrices A = U * sqrt(D) and B = V * sqrt(D) 3. Calculates row 2-norms for matrices A and B 4. Takes the maximum norm and multiplies by safety parameter

For count data, it is recommended to transform the data using log(X + 1) before applying this function.

Value

A list containing:

qmax

Truncation rank used

safety

Safety parameter applied

C_norm_hat

Original maximum row norm

C_est

Final conservative estimate of C

a_norms

Row norms of factor matrix A

b_norms

Row norms of factor matrix B

Examples

# Example 1: Continuous data
set.seed(123)
n <- 100; p <- 50; q <- 3
theta_true <- matrix(runif(n * q), n, q)
A_true <- matrix(runif(p * q), p, q)
X <- theta_true %*% t(A_true) + matrix(rnorm(n * p, sd = 0.5), n, p)

# Estimate C
C_result <- estimate_C(X, qmax = 5)
print(C_result$C_est)

# Example 2: Count data (apply log transformation)
lambda <- exp(theta_true %*% t(A_true))
X_count <- matrix(rpois(n * p, lambda = as.vector(lambda)), n, p)
X_transformed <- log(X_count + 1)
C_count <- estimate_C(X_transformed, qmax = 5)
print(C_count$C_est)

Estimate constraint constant C for binary data

Description

Data-driven estimation of the constraint constant C for binary data using cross-window smoothing and empirical logit transformation.

Usage

estimate_C_binary(X, qmax = 8, safety = 1.5, eps = 1e-12, radius = 1)

Arguments

X

n x p binary data matrix (0/1 values)

qmax

Rank for truncated SVD (default 8)

safety

Safety parameter for conservative estimation (default 1.5)

eps

Small constant to avoid logit divergence when p=0 or p=1 (default 1e-12)

radius

Radius for cross-window smoothing (default 1)

Details

The function performs the following steps: 1. Applies cross-window smoothing to estimate probabilities 2. Performs empirical logit transformation with smoothing 3. Computes truncated SVD of the transformed matrix 4. Constructs matrices A and B and calculates row norms 5. Estimates C as the maximum norm times safety parameter

The cross-window smoothing helps stabilize probability estimates, especially for sparse binary data.

Value

A list containing:

radius

Cross-window radius used

qmax

Truncation rank used

safety

Safety parameter applied

C0

Original maximum row norm

C_est

Final conservative estimate of C

a_norms

Row norms of factor matrix A

b_norms

Row norms of factor matrix B

Mhat

Logit-transformed matrix

P_smooth

Smoothed probability matrix

N_counts

Count of values in each smoothing window

Generate binary data example

Description

Generate simulated data from a binary (logistic) factor model.

Usage

generate_binary_data(n = 100, p = 50, q = 3)

Arguments

n

Integer. Number of observations.

p

Integer. Number of variables.

q

Integer. True number of latent factors.

Value

A named list with components:

resp: Binary matrix (n x p). Generated 0/1 responses.
true_q: Integer. True number of factors used in simulation.
theta_true: Numeric matrix (n x q). True latent factor scores.
A_true: Numeric matrix (p x q). True factor loadings.
d_true: Numeric vector (length p). Item intercepts.

Generate binary data with missing values

Description

Generate simulated data from a binary (logistic) factor model with missing values.

Usage

generate_binary_data_miss(n = 100, p = 50, q = 3, miss_prop = 0.05)

Arguments

n

Integer. Number of observations.

p

Integer. Number of variables.

q

Integer. True number of latent factors.

miss_prop

Numeric in (0,1). Proportion of missing values (default 0.05).

Value

A named list with components:

resp: Binary matrix (n x p). Generated 0/1 responses with missing values (NA).
resp_complete: Binary matrix (n x p). Complete data before missingness.
true_q: Integer. True number of factors used in simulation.
theta_true: Numeric matrix. True latent factor scores.
A_true: Numeric matrix. True factor loadings.
d_true: Numeric vector (length p). Item intercepts.
miss_prop: Numeric. Proportion of entries set to missing.

Generate continuous data example

Description

Generate simulated data from a Gaussian factor model.

Usage

generate_continuous_data(n = 100, p = 50, q = 3, noise_sd = 1)

Arguments

n

Integer. Number of observations.

p

Integer. Number of variables.

q

Integer. True number of latent factors.

noise_sd

Numeric. Standard deviation of Gaussian noise.

Value

A named list with components:

resp: Numeric matrix (n x p). Generated observed data.
true_q: Integer. True number of factors used in simulation.
theta_true: Numeric matrix (n x (q+1)). True latent factor scores with intercept.
A_true: Numeric matrix (p x (q+1)). True factor loadings.

Generate continuous data with missing values

Description

Generate simulated data from a Gaussian factor model with missing values.

Usage

generate_continuous_data_miss(
  n = 100,
  p = 50,
  q = 3,
  noise_sd = 1,
  miss_prop = 0.05
)

Arguments

n

Integer. Number of observations.

p

Integer. Number of variables.

q

Integer. True number of latent factors.

noise_sd

Numeric. Standard deviation of Gaussian noise.

miss_prop

Numeric in (0,1). Proportion of missing values (default 0.05).

Value

A named list with components:

resp: Numeric matrix (n x p). Generated data with missing values (NA).
resp_complete: Numeric matrix (n x p). Complete data before missingness.
true_q: Integer. True number of factors used in simulation.
theta_true: Numeric matrix (n x (q+1)). True latent factor scores with intercept.
A_true: Numeric matrix (p x (q+1)). True factor loadings.
miss_prop: Numeric. Proportion of entries set to missing.

Generate count data example

Description

Generate simulated data from a Poisson factor model.

Usage

generate_count_data(n = 100, p = 50, q = 3)

Arguments

n

Integer. Number of observations.

p

Integer. Number of variables.

q

Integer. True number of latent factors.

Value

A named list with components:

resp: Integer matrix (n x p). Generated Poisson observations.
true_q: Integer. True number of factors used in simulation.
theta_true: Numeric matrix (n x (q+1)). True latent factor scores with intercept.
A_true: Numeric matrix (p x (q+1)). True factor loadings.

Generate count data with missing values

Description

Generate simulated data from a Poisson factor model with missing values.

Usage

generate_count_data_miss(n = 100, p = 50, q = 3, miss_prop = 0.05)

Arguments

n

Integer. Number of observations.

p

Integer. Number of variables.

q

Integer. True number of latent factors.

miss_prop

Numeric in (0,1). Proportion of missing values (default 0.05).

Value

A named list with components:

resp: Integer matrix (n x p). Generated data with missing values (NA).
resp_complete: Integer matrix (n x p). Complete data before missingness.
true_q: Integer. True number of factors used in simulation.
theta_true: Numeric matrix (n x (q+1)). True latent factor scores with intercept.
A_true: Numeric matrix (p x (q+1)). True factor loadings.
miss_prop: Numeric. Proportion of entries set to missing.

Entrywise Splitting Cross-Validation for Factor Models

Description

Uses (Penalized) Entrywise Splitting Cross-Validation (ECV / pECV) to estimate the number of latent factors in generalized factor models.

Usage

pECV(
  resp,
  C = 5,
  qmax = 8,
  fold = 5,
  tol_val = 0.01,
  theta0 = NULL,
  A0 = NULL,
  seed = 1,
  data_type = NULL
)

Arguments

resp

Observation data matrix (n x p); can be continuous, count, or binary.

C

Constraint constant, default is 5.

qmax

Maximum number of factors to consider, default is 8.

fold

Number of folds in cross-validation, default is 5.

tol_val

Convergence tolerance, default is 0.01 (interpreted as 0.01 / number of estimated elements).

theta0

Optional initial matrix for factors; sampled from Uniform if not provided.

A0

Optional initial matrix for loadings; sampled from Uniform if not provided.

seed

Random seed, default is 1.

data_type

Data type, one of "continuous", "count", "binary". If not specified, it is auto-detected.

Details

The example below may take more than 5 seconds on some machines and is therefore not run during routine checks.

Value

A named list with components:

ECV: Integer. Number of factors selected by standard ECV.
p1ECV: Integer. Number of factors selected by ECV with penalty 1.
p2ECV: Integer. Number of factors selected by ECV with penalty 2.
p3ECV: Integer. Number of factors selected by ECV with penalty 3.
p4ECV: Integer. Number of factors selected by ECV with penalty 4.
ECV_loss: Numeric vector. Cross-validation loss for each candidate factor number (typically of length qmax).
data_type: Character. The detected/used data type: "continuous", "count", or "binary".

The return value has base R types (no special S3/S4 class).

Examples


set.seed(123)
# Generate count data
n <- 50; p <- 50; q <- 2
theta_true <- cbind(1, matrix(runif(n * q, -2, 2), n, q))
A_true <- matrix(runif(p * (q + 1), -2, 2), p, (q + 1))
lambda <- exp(theta_true %*% t(A_true))
resp <- matrix(
  rpois(length(lambda), lambda = as.vector(lambda)),
  nrow = nrow(lambda), ncol = ncol(lambda)
)
result <- pECV(resp, C = 4, qmax = 4, fold = 5)
print(result)

Entrywise Splitting Cross-Validation with Missing Data

Description

Uses (Penalized) Entrywise Splitting Cross-Validation to estimate the number of latent factors in generalized factor models when the data contain missing values.

Usage

pECV.miss(
  resp,
  C = 5,
  qmax = 8,
  fold = 5,
  tol_val = 0.01,
  theta0 = NULL,
  A0 = NULL,
  seed = 1,
  data_type = NULL
)

Arguments

resp

Observation data matrix (n x p) with missing values as NA; can be continuous, count, or binary.

C

Constraint constant, default is 5.

qmax

Maximum number of factors to consider, default is 8.

fold

Number of folds in cross-validation, default is 5.

tol_val

Convergence tolerance, default is 0.01 (interpreted as 0.01 / number of estimated elements).

theta0

Optional initial matrix for factors; sampled from Uniform if not provided.

A0

Optional initial matrix for loadings; sampled from Uniform if not provided.

seed

Random seed, default is 1.

data_type

Data type, one of "continuous", "count", "binary". If not specified, it is auto-detected.

Details

The example below may take more than 5 seconds on some machines and is therefore not run during routine checks.

Value

A named list with components:

ECV: Integer. Number of factors selected by standard ECV.
p1ECV: Integer. Number of factors selected by ECV with penalty 1.
p2ECV: Integer. Number of factors selected by ECV with penalty 2.
p3ECV: Integer. Number of factors selected by ECV with penalty 3.
p4ECV: Integer. Number of factors selected by ECV with penalty 4.
ECV_loss: Numeric vector. Cross-validation loss for each candidate factor number (typically of length qmax).
data_type: Character. The detected/used data type: "continuous", "count", or "binary".
miss_percent: Numeric scalar. Percentage of missing entries in resp.

The return value uses base R types (no special S3/S4 class).

Examples


set.seed(123)
# Generate count data with missing values
n <- 50; p <- 50; q <- 2
theta_true <- cbind(1, matrix(runif(n * q, -2, 2), n, q))
A_true <- matrix(runif(p * (q + 1), -2, 2), p, (q + 1))
lambda <- exp(theta_true %*% t(A_true))
resp <- matrix(
  rpois(length(lambda), lambda = as.vector(lambda)),
  nrow = nrow(lambda), ncol = ncol(lambda)
)
# Introduce 5% missing values
miss_idx <- sample(1:(n * p), size = 0.05 * n * p)
resp[miss_idx] <- NA
result <- pECV.miss(resp, C = 4, qmax = 4, fold = 5)
print(result)