| Title: | Stepwise Clustered Ensemble | 
| Version: | 1.1.2 | 
| Description: | Implementation of Stepwise Clustered Ensemble (SCE) and Stepwise Cluster Analysis (SCA) for multivariate data analysis. The package provides comprehensive tools for feature selection, model training, prediction, and evaluation in hydrological and environmental modeling applications. Key functionalities include recursive feature elimination (RFE), Wilks feature importance analysis, model validation through out-of-bag (OOB) validation, and ensemble prediction capabilities. The package supports both single and multivariate response variables, making it suitable for complex environmental modeling scenarios. For more details see Li et al. (2021) <doi:10.5194/hess-25-4947-2021>. | 
| URL: | https://doi.org/10.5194/hess-25-4947-2021 | 
| License: | GPL-3 | 
| Encoding: | UTF-8 | 
| RoxygenNote: | 7.2.3 | 
| Depends: | R (≥ 3.5.0) | 
| Imports: | stats (≥ 3.5.0), utils (≥ 3.5.0) | 
| Suggests: | testthat (≥ 3.0.0), knitr, rmarkdown | 
| NeedsCompilation: | no | 
| Packaged: | 2025-10-04 21:49:51 UTC; lkl98 | 
| Author: | Kailong Li [aut, cre] | 
| Maintainer: | Kailong Li <lkl98509509@gmail.com> | 
| Repository: | CRAN | 
| Date/Publication: | 2025-10-04 22:10:02 UTC | 
Air Quality Dataset
Description
These datasets contain air quality measurements for training and testing purposes. They include various air pollutant concentrations and meteorological variables measured at different locations and times.
Usage
data("Air_quality_training")
data("Air_quality_testing")
Format
Both datasets are data frames with 8760 rows and 12 variables:
- Date
- Date and time of measurement (POSIXct format) 
- PM2.5
- Particulate matter with diameter less than 2.5 micrometers (\mu g/m^3) 
- PM10
- Particulate matter with diameter less than 10 micrometers (\mu g/m^3) 
- SO2
- Sulfur dioxide concentration (\mu g/m^3) 
- NO2
- Nitrogen dioxide concentration (\mu g/m^3) 
- CO
- Carbon monoxide concentration (\mu g/m^3) 
- O3
- Ozone concentration (\mu g/m^3) 
- TEMP
- Temperature (\textdegree C) 
- PRES
- Atmospheric pressure (hPa) 
- DEWP
- Dew point temperature (\textdegree C) 
- RAIN
- Precipitation amount (mm) 
- WSPM
- Wind speed (m/s) 
Details
Dataset Differences:
-  Air_quality_training: Used for training SCA and SCE models
-  Air_quality_testing: Used for testing trained models
Variable Descriptions:
-  PM2.5, PM10: Particulate matter concentrations, important indicators of air quality 
-  SO2, NO2, CO, O3: Major air pollutants regulated by environmental agencies 
-  TEMP, PRES, DEWP: Meteorological variables affecting air quality 
-  RAIN, WSPM: Weather conditions that influence pollutant dispersion 
Source
Air quality monitoring stations
Plot Recursive Feature Elimination Results
Description
Plot Recursive Feature Elimination results.
Usage
Plot_RFE(rfe_result, 
         main = "OOB Validation and Testing R2 vs Number of Predictors", 
         col_validation = "blue", 
         col_testing = "red", 
         pch = 16, 
         lwd = 2, 
         cex = 1.2, 
         legend_pos = "bottomleft", 
         ...)
Arguments
| rfe_result | Result object from RFE_SCE function | 
| main | Plot title | 
| col_validation | Color for validation line | 
| col_testing | Color for testing line | 
| pch | Point character | 
| lwd | Line width | 
| cex | Point size | 
| legend_pos | Legend position | 
| ... | Additional arguments | 
Value
Plot showing validation and testing R2 vs number of predictors.
See Also
Recursive Feature Elimination for SCE Models
Description
Recursive Feature Elimination for SCE models to identify the most important predictors.
Usage
RFE_SCE(Training_data, Testing_data, Predictors, Predictant, Nmin, Ntree, 
        alpha = 0.05, resolution = 1000, step = 1, verbose = TRUE, 
        parallel = TRUE)
Arguments
| Training_data | Training dataset | 
| Testing_data | Testing dataset | 
| Predictors | Character vector of predictor names | 
| Predictant | Character vector of predictant names | 
| Nmin | Minimum samples per node | 
| Ntree | Number of trees | 
| alpha | Significance level (default: 0.05) | 
| resolution | Resolution for splitting (default: 1000) | 
| step | Number of predictors to remove per iteration (default: 1) | 
| verbose | Print progress (default: TRUE) | 
| parallel | Use parallel processing (default: TRUE) | 
Value
RFE results with performance metrics and importance scores.
See Also
Stepwise Cluster Analysis (SCA)
Description
Builds a single Stepwise Cluster Analysis (SCA) tree model that recursively partitions the data space based on Wilks' Lambda statistic.
Usage
SCA(Training_data, X, Y, Nmin, alpha = 0.05, resolution = 1000, verbose = FALSE)
Arguments
| Training_data | A data.frame containing the training data | 
| X | Character vector of predictor variable names | 
| Y | Character vector of predictant variable names | 
| Nmin | Minimum number of samples in a leaf node | 
| alpha | Significance level for clustering (default: 0.05) | 
| resolution | Resolution for splitting (default: 1000) | 
| verbose | Print progress information (default: FALSE) | 
Value
An S3 object of class "SCA" containing the tree model.
See Also
SCE, predict, importance, evaluate
Examples
  # Load example data
  data(Streamflow_training_10var)
  data(Streamflow_testing_10var)
  
  # Define variables
  Predictors <- c("Prcp","SRad","Tmax","Tmin","VP","smlt","swvl1","swvl2","swvl3","swvl4")
  Predictants <- c("Flow")
  
  # Build SCA model
  sca_model <- SCA(
    Training_data = Streamflow_training_10var,
    X = Predictors,
    Y = Predictants,
    Nmin = 5,
    alpha = 0.05,
    resolution = 1000
  )
  
  # Use S3 methods
  print(sca_model)
  summary(sca_model)
  sca_predictions <- predict(sca_model, Streamflow_testing_10var)
  sca_importance <- importance(sca_model)
  sca_evaluation <- evaluate(sca_model, Streamflow_testing_10var, Streamflow_training_10var)
Stepwise Clustered Ensemble (SCE)
Description
Builds a Stepwise Clustered Ensemble (SCE) model, which is an ensemble of SCA trees built using bootstrap samples and random feature selection, providing improved prediction accuracy and robustness.
Usage
SCE(Training_data, X, Y, mfeature, Nmin, Ntree, alpha = 0.05, 
    resolution = 1000, verbose = FALSE, parallel = TRUE)
Arguments
| Training_data | A data.frame containing the training data | 
| X | Character vector of predictor variable names | 
| Y | Character vector of predictant variable names | 
| mfeature | Number of features to randomly select for each tree | 
| Nmin | Minimum number of samples in a leaf node | 
| Ntree | Number of trees in the ensemble | 
| alpha | Significance level for clustering (default: 0.05) | 
| resolution | Resolution for splitting (default: 1000) | 
| verbose | Print progress information (default: FALSE) | 
| parallel | Use parallel processing (default: TRUE) | 
Value
An S3 object of class "SCE" containing the ensemble model.
See Also
SCA, predict, importance, evaluate
Examples
  # Load example data
  data(Streamflow_training_10var)
  data(Streamflow_testing_10var)
  
  # Define variables
  Predictors <- c("Prcp","SRad","Tmax","Tmin","VP","smlt","swvl1","swvl2","swvl3","swvl4")
  Predictants <- c("Flow")
  
  # Build SCE model
  sce_model <- SCE(
    Training_data = Streamflow_training_10var,
    X = Predictors,
    Y = Predictants,
    mfeature = round(0.5 * length(Predictors)),
    Nmin = 5,
    Ntree = 48,
    alpha = 0.05,
    resolution = 1000,
    parallel = FALSE
  )
  
  # Use S3 methods
  print(sce_model)
  summary(sce_model)
  sce_predictions <- predict(sce_model, Streamflow_testing_10var)
  sce_importance <- importance(sce_model)
  sce_evaluation <- evaluate(sce_model, Streamflow_testing_10var, Streamflow_training_10var)
Streamflow Dataset
Description
These datasets contain streamflow and related environmental variables for training and testing purposes. They are used in examples to demonstrate the SCE package functionality with different levels of complexity.
Usage
data("Streamflow_training_10var")
data("Streamflow_training_22var")
data("Streamflow_testing_10var")
data("Streamflow_testing_22var")
Format
Streamflow_training_10var: Basic environmental variables (12 columns):
- Date
- Date and time of measurement 
- Prcp
- Monthly mean daily precipitation (mm) 
- SRad
- Monthly mean daily solar radiation (W/m^2) 
- Tmax
- Monthly mean daily maximum temperature (°C) 
- Tmin
- Monthly mean daily minimum temperature (°C) 
- VP
- Monthly mean daily vapor pressure (Pa) 
- smlt
- Monthly snowmelt (m) 
- swvl1
- Soil water content layer 1 (m^3/m^3) 
- swvl2
- Soil water content layer 2 (m^3/m^3) 
- swvl3
- Soil water content layer 3 (m^3/m^3) 
- swvl4
- Soil water content layer 4 (m^3/m^3) 
- Flow
- Monthly mean daily streamflow (cfs) 
Streamflow_training_22var: Extended variables with climate indices (24 columns):
- Flow
- Streamflow measurements 
- IPO
- Interdecadal Pacific Oscillation 
- IPO_lag1
- IPO with 1-month lag 
- IPO_lag2
- IPO with 2-month lag 
- Nino3.4
- Nino 3.4 index 
- Nino3.4_lag1
- Nino 3.4 with 1-month lag 
- Nino3.4_lag2
- Nino 3.4 with 2-month lag 
- PDO
- Pacific Decadal Oscillation 
- PDO_lag1
- PDO with 1-month lag 
- PDO_lag2
- PDO with 2-month lag 
- PNA
- Pacific North American pattern 
- PNA_lag1
- PNA with 1-month lag 
- PNA_lag2
- PNA with 2-month lag 
- Precipitation
- Monthly precipitation 
- Precipitation_2Mon
- 2-month precipitation 
- Radiation
- Solar radiation 
- Radiation_2Mon
- 2-month solar radiation 
- Tmax
- Maximum temperature 
- Tmax_2Mon
- 2-month maximum temperature 
- Tmin
- Minimum temperature 
- Tmin_2Mon
- 2-month minimum temperature 
- VP
- Vapor pressure 
- VP_2Mon
- 2-month vapor pressure 
Testing datasets: Same structure as corresponding training datasets.
Details
Dataset Structure:
-  10var datasets: Basic environmental variables (12 columns) 
-  22var datasets: Extended variables with climate indices (24 columns) 
-  Training datasets: Used for model building 
-  Testing datasets: Used for model evaluation 
Climate Indices: IPO (Interdecadal Pacific Oscillation), Nino3.4 (El Niño), PDO (Pacific Decadal Oscillation), PNA (Pacific North American pattern)
Data Sources: ERA5 Land, Daymet, USGS, and climate indices databases
Source
Environmental monitoring stations, climate indices databases, ERA5 Land, Daymet, and USGS
Evaluate SCE and SCA Model Performance
Description
Evaluate model performance for SCE or SCA models.
Usage
## S3 method for class 'SCE'
evaluate(object, Testing_data, Training_data, digits = 3, ...)
## S3 method for class 'SCA'
evaluate(object, Testing_data, Training_data, digits = 3, ...)
Arguments
| object | An SCE or SCA model object | 
| Testing_data | Testing dataset | 
| Training_data | Training dataset | 
| digits | Number of decimal places (default: 3) | 
| ... | Additional arguments | 
Value
Model performance metrics.
See Also
Variable Importance for SCE and SCA Models
Description
Calculate variable importance for SCE or SCA models.
Usage
## S3 method for class 'SCE'
importance(object, OOB_weight = TRUE, digits = 2, ...)
## S3 method for class 'SCA'
importance(object, digits = 2, ...)
Arguments
| object | An SCE or SCA model object | 
| OOB_weight | Use out-of-bag weights for importance calculation (SCE only, default: TRUE) | 
| digits | Number of decimal places to round the returned relative importance values (default: 2) | 
| ... | Additional arguments | 
Value
Variable importance rankings. For convenience, relative importance values are rounded to digits decimal places.
See Also
Predict Using SCE and SCA Models
Description
Make predictions on new data using SCE or SCA models.
Usage
## S3 method for class 'SCE'
predict(object, newdata, ...)
## S3 method for class 'SCA'
predict(object, newdata, ...)
Arguments
| object | An SCE or SCA model object | 
| newdata | New data for prediction | 
| ... | Additional arguments | 
Value
Predictions for the new data.
See Also
Print SCE and SCA Model Objects
Description
Print information about SCE or SCA model objects.
Usage
## S3 method for class 'SCE'
print(x, ...)
## S3 method for class 'SCA'
print(x, ...)
Arguments
| x | An SCE or SCA model object | 
| ... | Additional arguments (not used) | 
Details
For SCE objects, prints ensemble information including number of trees, parameters, predictors, predictants, and OOB performance metrics.
For SCA objects, prints tree structure information including total nodes, leaf nodes, cutting/merging actions, and variable names.
Value
Prints model information and returns the object invisibly.
See Also
Summary methods for SCE and SCA models
Description
Provide concise summaries of model structure and performance for SCE and SCA objects.
Usage
## S3 method for class 'SCE'
summary(object, ...)
## S3 method for class 'SCA'
summary(object, ...)
Arguments
| object | An SCE or SCA model object | 
| ... | Additional arguments passed to or ignored by methods | 
Details
For summary.SCE, the method prints ensemble configuration, out-of-bag (OOB) performance statistics, tree structure information, and tree weight distribution.\
For summary.SCA, the method prints tree structure information and variable summaries for the single SCA tree.
Value
Invisibly returns the input object after printing the summary.
See Also
SCE, SCA, print, importance, evaluate