library(filtro)
library(desirability2)
library(dplyr)
library(modeldata)
⚠️ work-in-progress
This document demonstrates some basic uses of filtro. We’ll need to load a few packages:
library(filtro)
library(desirability2)
library(dplyr)
library(modeldata)
The {modeldata} package contains a data set used to predict housing sale price. It has 73 predictor columns and a numeric variable Sale_Price
(the outcome). Since the outcome are right-skewed, we apply a log (base 10) transformation.
<- modeldata::ames
ames <- ames |>
ames ::mutate(Sale_Price = log10(Sale_Price))
dplyr
# ames |> str() # uncomment to see the structure of the data
To apply the ANOVA F-test filter, we first create a score class object to define the scoring method, and then use the fit()
method with the standard formula to compute the scores.
<-
ames_aov_pval_res |>
score_aov_pval fit(Sale_Price ~ ., data = ames)
The data frame of results can be accessed via object@results
.
@results
ames_aov_pval_res#> # A tibble: 73 × 4
#> name score outcome predictor
#> <chr> <dbl> <chr> <chr>
#> 1 aov_pval 237. Sale_Price MS_SubClass
#> 2 aov_pval 130. Sale_Price MS_Zoning
#> 3 aov_pval NA Sale_Price Lot_Frontage
#> 4 aov_pval NA Sale_Price Lot_Area
#> 5 aov_pval 5.75 Sale_Price Street
#> 6 aov_pval 19.2 Sale_Price Alley
#> 7 aov_pval 71.3 Sale_Price Lot_Shape
#> 8 aov_pval 21.4 Sale_Price Land_Contour
#> 9 aov_pval 1.38 Sale_Price Utilities
#> 10 aov_pval 12.0 Sale_Price Lot_Config
#> # ℹ 63 more rows
A couple of notes here:
Since our focus is on feature relevance (rather than hypothesis testing), the ANOVA F-test filter handles both cases when:
The predictors are numeric and the outcome is categorical, or
The predictors are categorical and the outcome is numeric.
Because the outcome is numeric, any predictor that is not a factor will result in an NA
. In case where NA
is produced, a safe value can be used to retain the predictor, and can be accessed via object@fallback_value
.
By default, this filter computes -log10(p_value)
, so that larger values indicate more important predictors. If users prefer raw p-values, a helper function dont_log_pvalues()
is available.
For this specific filter, i.e., score_aov_*
, case weights are supported. For other filters, you can check the property object@case_weights
to see if they can use case weights.
There are two main ways to rank and select a top proportion or number of features.
To filter or rank a single score, we can use built-in methods:
show_best_score_*()
rank_best_score_*()
For multi-parameter optimization, we can use API calls adapted from {desirability}:
show_best_desirability_*()
The show_best_score_prop()
function returns the best score for a single metric. The prop_terms
argument lets us control the proportion of predictors to keep.
# Show best score, based on proportion of predictors
|> show_best_score_prop(prop_terms = 0.2)
ames_aov_pval_res #> # A tibble: 14 × 4
#> name score outcome predictor
#> <chr> <dbl> <chr> <chr>
#> 1 aov_pval Inf Sale_Price Neighborhood
#> 2 aov_pval 288. Sale_Price Garage_Finish
#> 3 aov_pval 243. Sale_Price Garage_Type
#> 4 aov_pval 242. Sale_Price Foundation
#> 5 aov_pval 237. Sale_Price MS_SubClass
#> 6 aov_pval 183. Sale_Price Heating_QC
#> 7 aov_pval 173. Sale_Price BsmtFin_Type_1
#> 8 aov_pval 132. Sale_Price Mas_Vnr_Type
#> 9 aov_pval 130. Sale_Price Overall_Cond
#> 10 aov_pval 130. Sale_Price MS_Zoning
#> 11 aov_pval 127. Sale_Price Exterior_1st
#> 12 aov_pval 116. Sale_Price Exterior_2nd
#> 13 aov_pval 116. Sale_Price Bsmt_Exposure
#> 14 aov_pval 100. Sale_Price Garage_Cond
To handle multiple scores, we first create multiple score class objects, and then use the fit()
method with the standard formula to compute the scores.
# ANOVA raw p-value
<- score_aov_pval |> dont_log_pvalues()
natrual_units <-
ames_aov_pval_natrual_res |>
natrual_units fit(Sale_Price ~ ., data = ames)
# Pearson correlation
<-
ames_cor_pearson_res |>
score_cor_pearson fit(Sale_Price ~ ., data = ames)
# Forest importance
<-
ames_imp_rf_reg_res |>
score_imp_rf fit(Sale_Price ~ ., data = ames, seed = 42)
# Information gain
<-
ames_info_gain_reg_res |>
score_info_gain fit(Sale_Price ~ ., data = ames)
Next, we create a list to collect these score class objects, including their associated metadata and scores.
# Create a list
<- list(
class_score_list
ames_aov_pval_natrual_res,
ames_cor_pearson_res,
ames_imp_rf_reg_res,
ames_info_gain_reg_res )
Then, we fill the safe value specific to each method, and then remove the outcome
column.
# Fill safe values
<- class_score_list |>
ames_scores_results fill_safe_values() |>
# Remove outcome
::select(-outcome)
dplyr
ames_scores_results#> # A tibble: 73 × 5
#> predictor aov_pval cor_pearson imp_rf infogain
#> <chr> <dbl> <dbl> <dbl> <dbl>
#> 1 MS_SubClass 1.68e-237 1 0.000450 0.266
#> 2 MS_Zoning 2.75e-130 1 0.000434 0.113
#> 3 Lot_Frontage 1.11e- 16 0.165 0.000192 0.146
#> 4 Lot_Area 1.11e- 16 0.255 0.000690 0.140
#> 5 Street 1.77e- 6 1 0.00000300 0.00365
#> 6 Alley 6.06e- 20 1 0.0000111 0.0254
#> 7 Lot_Shape 5.17e- 72 1 0.0000806 0.0675
#> 8 Land_Contour 3.79e- 22 1 0.0000582 0.0212
#> 9 Utilities 4.16e- 2 1 0 0.00165
#> 10 Lot_Config 1.04e- 12 1 0.0000112 0.0133
#> # ℹ 63 more rows
Analogous to show_best_desirability()
, the show_best_desirability_prop()
function allows joint optimization of multiple metrics using desirability functions.
A desirability function maps values of a metric to a [0, 1] range where 1 is most desirable and 0 is unacceptable. When the verb maximize()
is used, it means larger values are better. This is the case for Pearson correlation, forest importance, and information gain.
For examples:
# Optimize correlation alone
|>
ames_scores_results show_best_desirability_prop(
maximize(cor_pearson, low = 0, high = 1)
|>
) # Show predictor and desirability only
::select(predictor, starts_with(".d_"))
dplyr#> # A tibble: 73 × 3
#> predictor .d_max_cor_pearson .d_overall
#> <chr> <dbl> <dbl>
#> 1 MS_SubClass 1 1
#> 2 MS_Zoning 1 1
#> 3 Street 1 1
#> 4 Alley 1 1
#> 5 Lot_Shape 1 1
#> 6 Land_Contour 1 1
#> 7 Utilities 1 1
#> 8 Lot_Config 1 1
#> 9 Land_Slope 1 1
#> 10 Neighborhood 1 1
#> # ℹ 63 more rows
# Optimize correlation and forest importance
|>
ames_scores_results show_best_desirability_prop(
maximize(cor_pearson, low = 0, high = 1),
maximize(imp_rf)
|>
) ::select(predictor, starts_with(".d_"))
dplyr#> # A tibble: 73 × 4
#> predictor .d_max_cor_pearson .d_max_imp_rf .d_overall
#> <chr> <dbl> <dbl> <dbl>
#> 1 Gr_Liv_Area 0.696 1 0.834
#> 2 Year_Built 0.615 0.869 0.731
#> 3 Total_Bsmt_SF 0.626 0.554 0.588
#> 4 Garage_Cars 0.675 0.453 0.553
#> 5 Garage_Type 1 0.298 0.546
#> 6 First_Flr_SF 0.603 0.479 0.537
#> 7 Year_Remod_Add 0.586 0.452 0.515
#> 8 Garage_Area 0.651 0.395 0.507
#> 9 Foundation 1 0.184 0.428
#> 10 Full_Bath 0.577 0.272 0.396
#> # ℹ 63 more rows
# Optimize correlation, forest importance and information gain
|>
ames_scores_results show_best_desirability_prop(
maximize(cor_pearson, low = 0, high = 1),
maximize(imp_rf),
maximize(infogain)
|>
) ::select(predictor, starts_with(".d_"))
dplyr#> # A tibble: 73 × 5
#> predictor .d_max_cor_pearson .d_max_imp_rf .d_max_infogain .d_overall
#> <chr> <dbl> <dbl> <dbl> <dbl>
#> 1 Gr_Liv_Area 0.696 1 0.832 0.833
#> 2 Year_Built 0.615 0.869 0.709 0.724
#> 3 Total_Bsmt_SF 0.626 0.554 0.625 0.600
#> 4 Garage_Cars 0.675 0.453 0.708 0.600
#> 5 Garage_Area 0.651 0.395 0.684 0.560
#> 6 First_Flr_SF 0.603 0.479 0.551 0.542
#> 7 Year_Remod_Add 0.586 0.452 0.514 0.515
#> 8 Garage_Type 1 0.298 0.453 0.513
#> 9 Neighborhood 1 0.119 1 0.491
#> 10 Foundation 1 0.184 0.454 0.437
#> # ℹ 63 more rows
In show_best_desirability_prop()
, there is a argument called prop_terms
that lets us control the proportion of predictors to keep.
# Same as above, but retain only a proportion of predictors
|>
ames_scores_results show_best_desirability_prop(
maximize(cor_pearson, low = 0, high = 1),
maximize(imp_rf),
maximize(infogain),
prop_terms = 0.2
|>
) ::select(predictor, starts_with(".d_"))
dplyr#> # A tibble: 14 × 5
#> predictor .d_max_cor_pearson .d_max_imp_rf .d_max_infogain .d_overall
#> <chr> <dbl> <dbl> <dbl> <dbl>
#> 1 Gr_Liv_Area 0.696 1 0.832 0.833
#> 2 Year_Built 0.615 0.869 0.709 0.724
#> 3 Total_Bsmt_SF 0.626 0.554 0.625 0.600
#> 4 Garage_Cars 0.675 0.453 0.708 0.600
#> 5 Garage_Area 0.651 0.395 0.684 0.560
#> 6 First_Flr_SF 0.603 0.479 0.551 0.542
#> 7 Year_Remod_Add 0.586 0.452 0.514 0.515
#> 8 Garage_Type 1 0.298 0.453 0.513
#> 9 Neighborhood 1 0.119 1 0.491
#> 10 Foundation 1 0.184 0.454 0.437
#> 11 Full_Bath 0.577 0.272 0.527 0.435
#> 12 MS_SubClass 1 0.105 0.576 0.392
#> 13 Garage_Finish 1 0.0862 0.501 0.351
#> 14 Fireplaces 0.489 0.217 0.331 0.328
Besides maximize()
, additional verbs that are available are: minimize()
, target()
, and constrain()
. They are used in different situations:
maximize()
when larger values are better.
minimize()
when smaller values are better.
target()
when a specific value of the metric is important.
constrain()
when a range of values is equally desirable.
For examples:
|>
ames_scores_results show_best_desirability_prop(
minimize(aov_pval, low = 0, high = 1)
|>
) ::select(predictor, starts_with(".d_"))
dplyr#> # A tibble: 73 × 3
#> predictor .d_min_aov_pval .d_overall
#> <chr> <dbl> <dbl>
#> 1 MS_SubClass 1 1
#> 2 MS_Zoning 1 1
#> 3 Alley 1 1
#> 4 Lot_Shape 1 1
#> 5 Land_Contour 1 1
#> 6 Neighborhood 1 1
#> 7 Condition_1 1 1
#> 8 Bldg_Type 1 1
#> 9 House_Style 1 1
#> 10 Overall_Cond 1 1
#> # ℹ 63 more rows
|>
ames_scores_results show_best_desirability_prop(
target(cor_pearson, low = 0.2, target = 0.255, high = 0.9)
|>
) ::select(predictor, starts_with(".d_"))
dplyr#> # A tibble: 73 × 3
#> predictor .d_target_cor_pearson .d_overall
#> <chr> <dbl> <dbl>
#> 1 Lot_Area 1.00 1.00
#> 2 Second_Flr_SF 0.969 0.969
#> 3 Bsmt_Full_Bath 0.969 0.969
#> 4 Latitude 0.952 0.952
#> 5 Half_Bath 0.921 0.921
#> 6 Open_Porch_SF 0.899 0.899
#> 7 Wood_Deck_SF 0.879 0.879
#> 8 Mas_Vnr_Area 0.709 0.709
#> 9 Fireplaces 0.637 0.637
#> 10 TotRms_AbvGrd 0.632 0.632
#> # ℹ 63 more rows
|>
ames_scores_results show_best_desirability_prop(
constrain(cor_pearson, low = 0.2, high = 1)
|>
) ::select(predictor, starts_with(".d_"))
dplyr#> # A tibble: 73 × 3
#> predictor .d_box_cor_pearson .d_overall
#> <chr> <dbl> <dbl>
#> 1 MS_SubClass 1 1
#> 2 MS_Zoning 1 1
#> 3 Lot_Area 1 1
#> 4 Street 1 1
#> 5 Alley 1 1
#> 6 Lot_Shape 1 1
#> 7 Land_Contour 1 1
#> 8 Utilities 1 1
#> 9 Lot_Config 1 1
#> 10 Land_Slope 1 1
#> # ℹ 63 more rows
The list of score class objects included:
#> [1] "score_aov_fstat" "score_aov_pval"
#> [3] "score_cor_pearson" "score_cor_spearman"
#> [5] "score_gain_ratio" "score_imp_rf"
#> [7] "score_imp_rf_conditional" "score_imp_rf_oblique"
#> [9] "score_info_gain" "score_roc_auc"
#> [11] "score_sym_uncert" "score_xtab_pval_chisq"
#> [13] "score_xtab_pval_fisher"
The list of filter methods for score singular:
#> [1] "show_best_score_cutoff" "show_best_score_dual" "show_best_score_num"
#> [4] "show_best_score_prop"
The list of filter methods for scores plural:
#> [1] "show_best_desirability_num" "show_best_desirability_prop"