Introduction to filtro

⚠️ work-in-progress

This document demonstrates some basic uses of filtro. We’ll need to load a few packages:

library(filtro)
library(desirability2)
library(dplyr)
library(modeldata)

A scoring example

The {modeldata} package contains a data set used to predict housing sale price. It has 73 predictor columns and a numeric variable Sale_Price (the outcome). Since the outcome are right-skewed, we apply a log (base 10) transformation.

ames <- modeldata::ames
ames <- ames |>
  dplyr::mutate(Sale_Price = log10(Sale_Price))

# ames |> str() # uncomment to see the structure of the data

To apply the ANOVA F-test filter, we first create a score class object to define the scoring method, and then use the fit() method with the standard formula to compute the scores.

ames_aov_pval_res <-
  score_aov_pval |>
  fit(Sale_Price ~ ., data = ames)

The data frame of results can be accessed via object@results.

ames_aov_pval_res@results
#> # A tibble: 73 × 4
#>    name      score outcome    predictor   
#>    <chr>     <dbl> <chr>      <chr>       
#>  1 aov_pval 237.   Sale_Price MS_SubClass 
#>  2 aov_pval 130.   Sale_Price MS_Zoning   
#>  3 aov_pval  NA    Sale_Price Lot_Frontage
#>  4 aov_pval  NA    Sale_Price Lot_Area    
#>  5 aov_pval   5.75 Sale_Price Street      
#>  6 aov_pval  19.2  Sale_Price Alley       
#>  7 aov_pval  71.3  Sale_Price Lot_Shape   
#>  8 aov_pval  21.4  Sale_Price Land_Contour
#>  9 aov_pval   1.38 Sale_Price Utilities   
#> 10 aov_pval  12.0  Sale_Price Lot_Config  
#> # ℹ 63 more rows

A couple of notes here:

Since our focus is on feature relevance (rather than hypothesis testing), the ANOVA F-test filter handles both cases when:

Because the outcome is numeric, any predictor that is not a factor will result in an NA. In case where NA is produced, a safe value can be used to retain the predictor, and can be accessed via object@fallback_value.

By default, this filter computes -log10(p_value), so that larger values indicate more important predictors. If users prefer raw p-values, a helper function dont_log_pvalues() is available.

For this specific filter, i.e., score_aov_*, case weights are supported. For other filters, you can check the property object@case_weights to see if they can use case weights.

Filtering and ranking

There are two main ways to rank and select a top proportion or number of features.

To filter or rank a single score, we can use built-in methods:

For multi-parameter optimization, we can use API calls adapted from {desirability}:

A filtering exmple for score singular

The show_best_score_prop() function returns the best score for a single metric. The prop_terms argument lets us control the proportion of predictors to keep.

# Show best score, based on proportion of predictors
ames_aov_pval_res |> show_best_score_prop(prop_terms = 0.2)
#> # A tibble: 14 × 4
#>    name     score outcome    predictor     
#>    <chr>    <dbl> <chr>      <chr>         
#>  1 aov_pval  Inf  Sale_Price Neighborhood  
#>  2 aov_pval  288. Sale_Price Garage_Finish 
#>  3 aov_pval  243. Sale_Price Garage_Type   
#>  4 aov_pval  242. Sale_Price Foundation    
#>  5 aov_pval  237. Sale_Price MS_SubClass   
#>  6 aov_pval  183. Sale_Price Heating_QC    
#>  7 aov_pval  173. Sale_Price BsmtFin_Type_1
#>  8 aov_pval  132. Sale_Price Mas_Vnr_Type  
#>  9 aov_pval  130. Sale_Price Overall_Cond  
#> 10 aov_pval  130. Sale_Price MS_Zoning     
#> 11 aov_pval  127. Sale_Price Exterior_1st  
#> 12 aov_pval  116. Sale_Price Exterior_2nd  
#> 13 aov_pval  116. Sale_Price Bsmt_Exposure 
#> 14 aov_pval  100. Sale_Price Garage_Cond

A filtering example for scores plural

To handle multiple scores, we first create multiple score class objects, and then use the fit() method with the standard formula to compute the scores.

# ANOVA raw p-value 
natrual_units <- score_aov_pval |> dont_log_pvalues()
ames_aov_pval_natrual_res <-
  natrual_units |>
  fit(Sale_Price ~ ., data = ames)

# Pearson correlation
ames_cor_pearson_res <-
  score_cor_pearson |>
  fit(Sale_Price ~ ., data = ames)

# Forest importance
ames_imp_rf_reg_res <-
  score_imp_rf |>
  fit(Sale_Price ~ ., data = ames, seed = 42)

# Information gain
ames_info_gain_reg_res <-
  score_info_gain |>
  fit(Sale_Price ~ ., data = ames)

Next, we create a list to collect these score class objects, including their associated metadata and scores.

# Create a list
class_score_list <- list(
  ames_aov_pval_natrual_res, 
  ames_cor_pearson_res,
  ames_imp_rf_reg_res,
  ames_info_gain_reg_res
)

Then, we fill the safe value specific to each method, and then remove the outcome column.

# Fill safe values
ames_scores_results <- class_score_list |>
  fill_safe_values() |>
  # Remove outcome
  dplyr::select(-outcome)
ames_scores_results
#> # A tibble: 73 × 5
#>    predictor     aov_pval cor_pearson     imp_rf infogain
#>    <chr>            <dbl>       <dbl>      <dbl>    <dbl>
#>  1 MS_SubClass  1.68e-237       1     0.000450    0.266  
#>  2 MS_Zoning    2.75e-130       1     0.000434    0.113  
#>  3 Lot_Frontage 1.11e- 16       0.165 0.000192    0.146  
#>  4 Lot_Area     1.11e- 16       0.255 0.000690    0.140  
#>  5 Street       1.77e-  6       1     0.00000300  0.00365
#>  6 Alley        6.06e- 20       1     0.0000111   0.0254 
#>  7 Lot_Shape    5.17e- 72       1     0.0000806   0.0675 
#>  8 Land_Contour 3.79e- 22       1     0.0000582   0.0212 
#>  9 Utilities    4.16e-  2       1     0           0.00165
#> 10 Lot_Config   1.04e- 12       1     0.0000112   0.0133 
#> # ℹ 63 more rows

Analogous to show_best_desirability(), the show_best_desirability_prop() function allows joint optimization of multiple metrics using desirability functions.

A desirability function maps values of a metric to a [0, 1] range where 1 is most desirable and 0 is unacceptable. When the verb maximize() is used, it means larger values are better. This is the case for Pearson correlation, forest importance, and information gain.

For examples:

# Optimize correlation alone
ames_scores_results |>
  show_best_desirability_prop(
    maximize(cor_pearson, low = 0, high = 1)
  ) |> 
  # Show predictor and desirability only
  dplyr::select(predictor, starts_with(".d_"))
#> # A tibble: 73 × 3
#>    predictor    .d_max_cor_pearson .d_overall
#>    <chr>                     <dbl>      <dbl>
#>  1 MS_SubClass                   1          1
#>  2 MS_Zoning                     1          1
#>  3 Street                        1          1
#>  4 Alley                         1          1
#>  5 Lot_Shape                     1          1
#>  6 Land_Contour                  1          1
#>  7 Utilities                     1          1
#>  8 Lot_Config                    1          1
#>  9 Land_Slope                    1          1
#> 10 Neighborhood                  1          1
#> # ℹ 63 more rows

# Optimize correlation and forest importance
ames_scores_results |>
  show_best_desirability_prop(
    maximize(cor_pearson, low = 0, high = 1),
    maximize(imp_rf)
  ) |> 
  dplyr::select(predictor, starts_with(".d_"))
#> # A tibble: 73 × 4
#>    predictor      .d_max_cor_pearson .d_max_imp_rf .d_overall
#>    <chr>                       <dbl>         <dbl>      <dbl>
#>  1 Gr_Liv_Area                 0.696         1          0.834
#>  2 Year_Built                  0.615         0.869      0.731
#>  3 Total_Bsmt_SF               0.626         0.554      0.588
#>  4 Garage_Cars                 0.675         0.453      0.553
#>  5 Garage_Type                 1             0.298      0.546
#>  6 First_Flr_SF                0.603         0.479      0.537
#>  7 Year_Remod_Add              0.586         0.452      0.515
#>  8 Garage_Area                 0.651         0.395      0.507
#>  9 Foundation                  1             0.184      0.428
#> 10 Full_Bath                   0.577         0.272      0.396
#> # ℹ 63 more rows

# Optimize correlation, forest importance and information gain
ames_scores_results |>
  show_best_desirability_prop(
    maximize(cor_pearson, low = 0, high = 1),
    maximize(imp_rf),
    maximize(infogain)
  ) |> 
  dplyr::select(predictor, starts_with(".d_"))
#> # A tibble: 73 × 5
#>    predictor      .d_max_cor_pearson .d_max_imp_rf .d_max_infogain .d_overall
#>    <chr>                       <dbl>         <dbl>           <dbl>      <dbl>
#>  1 Gr_Liv_Area                 0.696         1               0.832      0.833
#>  2 Year_Built                  0.615         0.869           0.709      0.724
#>  3 Total_Bsmt_SF               0.626         0.554           0.625      0.600
#>  4 Garage_Cars                 0.675         0.453           0.708      0.600
#>  5 Garage_Area                 0.651         0.395           0.684      0.560
#>  6 First_Flr_SF                0.603         0.479           0.551      0.542
#>  7 Year_Remod_Add              0.586         0.452           0.514      0.515
#>  8 Garage_Type                 1             0.298           0.453      0.513
#>  9 Neighborhood                1             0.119           1          0.491
#> 10 Foundation                  1             0.184           0.454      0.437
#> # ℹ 63 more rows

In show_best_desirability_prop(), there is a argument called prop_terms that lets us control the proportion of predictors to keep.

# Same as above, but retain only a proportion of predictors
ames_scores_results |>
  show_best_desirability_prop(
    maximize(cor_pearson, low = 0, high = 1),
    maximize(imp_rf),
    maximize(infogain),
    prop_terms = 0.2
  ) |>
  dplyr::select(predictor, starts_with(".d_"))
#> # A tibble: 14 × 5
#>    predictor      .d_max_cor_pearson .d_max_imp_rf .d_max_infogain .d_overall
#>    <chr>                       <dbl>         <dbl>           <dbl>      <dbl>
#>  1 Gr_Liv_Area                 0.696        1                0.832      0.833
#>  2 Year_Built                  0.615        0.869            0.709      0.724
#>  3 Total_Bsmt_SF               0.626        0.554            0.625      0.600
#>  4 Garage_Cars                 0.675        0.453            0.708      0.600
#>  5 Garage_Area                 0.651        0.395            0.684      0.560
#>  6 First_Flr_SF                0.603        0.479            0.551      0.542
#>  7 Year_Remod_Add              0.586        0.452            0.514      0.515
#>  8 Garage_Type                 1            0.298            0.453      0.513
#>  9 Neighborhood                1            0.119            1          0.491
#> 10 Foundation                  1            0.184            0.454      0.437
#> 11 Full_Bath                   0.577        0.272            0.527      0.435
#> 12 MS_SubClass                 1            0.105            0.576      0.392
#> 13 Garage_Finish               1            0.0862           0.501      0.351
#> 14 Fireplaces                  0.489        0.217            0.331      0.328

Besides maximize(), additional verbs that are available are: minimize(), target(), and constrain(). They are used in different situations:

For examples:

ames_scores_results |>
  show_best_desirability_prop(
    minimize(aov_pval, low = 0, high = 1)
  ) |> 
  dplyr::select(predictor, starts_with(".d_"))
#> # A tibble: 73 × 3
#>    predictor    .d_min_aov_pval .d_overall
#>    <chr>                  <dbl>      <dbl>
#>  1 MS_SubClass                1          1
#>  2 MS_Zoning                  1          1
#>  3 Alley                      1          1
#>  4 Lot_Shape                  1          1
#>  5 Land_Contour               1          1
#>  6 Neighborhood               1          1
#>  7 Condition_1                1          1
#>  8 Bldg_Type                  1          1
#>  9 House_Style                1          1
#> 10 Overall_Cond               1          1
#> # ℹ 63 more rows

ames_scores_results |>
  show_best_desirability_prop(
    target(cor_pearson, low = 0.2, target = 0.255, high = 0.9)
  ) |> 
  dplyr::select(predictor, starts_with(".d_"))
#> # A tibble: 73 × 3
#>    predictor      .d_target_cor_pearson .d_overall
#>    <chr>                          <dbl>      <dbl>
#>  1 Lot_Area                       1.00       1.00 
#>  2 Second_Flr_SF                  0.969      0.969
#>  3 Bsmt_Full_Bath                 0.969      0.969
#>  4 Latitude                       0.952      0.952
#>  5 Half_Bath                      0.921      0.921
#>  6 Open_Porch_SF                  0.899      0.899
#>  7 Wood_Deck_SF                   0.879      0.879
#>  8 Mas_Vnr_Area                   0.709      0.709
#>  9 Fireplaces                     0.637      0.637
#> 10 TotRms_AbvGrd                  0.632      0.632
#> # ℹ 63 more rows

ames_scores_results |>
  show_best_desirability_prop(
    constrain(cor_pearson, low = 0.2, high = 1)
  ) |> 
  dplyr::select(predictor, starts_with(".d_"))
#> # A tibble: 73 × 3
#>    predictor    .d_box_cor_pearson .d_overall
#>    <chr>                     <dbl>      <dbl>
#>  1 MS_SubClass                   1          1
#>  2 MS_Zoning                     1          1
#>  3 Lot_Area                      1          1
#>  4 Street                        1          1
#>  5 Alley                         1          1
#>  6 Lot_Shape                     1          1
#>  7 Land_Contour                  1          1
#>  8 Utilities                     1          1
#>  9 Lot_Config                    1          1
#> 10 Land_Slope                    1          1
#> # ℹ 63 more rows

Available score objects and filter methods

The list of score class objects included:

#>  [1] "score_aov_fstat"          "score_aov_pval"          
#>  [3] "score_cor_pearson"        "score_cor_spearman"      
#>  [5] "score_gain_ratio"         "score_imp_rf"            
#>  [7] "score_imp_rf_conditional" "score_imp_rf_oblique"    
#>  [9] "score_info_gain"          "score_roc_auc"           
#> [11] "score_sym_uncert"         "score_xtab_pval_chisq"   
#> [13] "score_xtab_pval_fisher"

The list of filter methods for score singular:

#> [1] "show_best_score_cutoff" "show_best_score_dual"   "show_best_score_num"   
#> [4] "show_best_score_prop"

The list of filter methods for scores plural:

#> [1] "show_best_desirability_num"  "show_best_desirability_prop"