The supportR package is an amalgam of distinct functions
I’ve written to accomplish small data wrangling, quality control, or
visualization tasks. These functions tend to be short and
narrowly-defined. An additional consequence of the motivation for
creating them is that they tend to not be inter-related or united by a
common theme. To help account for that, I’ve divided this vignette into
groups of functions that do have similar purposes.
If any of these functions don’t work for you, please open an issue so that I can repair the issue as soon as possible.
In order to demonstrate some of the data wrangling functions of
supportR, we’ll use some some example data from Dr. Allison Horst’s palmerpenguins
R package.
# Check the structure of the penguins dataset
str(penguins)
#> 'data.frame':    344 obs. of  8 variables:
#>  $ species          : Factor w/ 3 levels "Adelie","Chinstrap",..: 1 1 1 1 1 1 1 1 1 1 ...
#>  $ island           : Factor w/ 3 levels "Biscoe","Dream",..: 3 3 3 3 3 3 3 3 3 3 ...
#>  $ bill_length_mm   : num  39.1 39.5 40.3 NA 36.7 39.3 38.9 39.2 34.1 42 ...
#>  $ bill_depth_mm    : num  18.7 17.4 18 NA 19.3 20.6 17.8 19.6 18.1 20.2 ...
#>  $ flipper_length_mm: int  181 186 195 NA 193 190 181 195 193 190 ...
#>  $ body_mass_g      : int  3750 3800 3250 NA 3450 3650 3625 4675 3475 4250 ...
#>  $ sex              : Factor w/ 2 levels "female","male": 2 1 1 NA 1 2 1 2 NA NA ...
#>  $ year             : int  2007 2007 2007 2007 2007 2007 2007 2007 2007 2007 ...With that data loaded, we can use the summary_table
function to quickly get group-wise summaries and retrieve generally
useful summary statistics.
The groups argument supports a vector of all of the
column names to group by while response must be a single
numeric column. The drop_na argument allows group
combinations that result in an NA to be automatically dropped (i.e., if
a penguin didn’t have an island listed that would be dropped). The mean,
standard deviation (SD), sample size, and standard error (SE) are all
returned to facilitate easy figure creation. There is also a
round_digits argument that lets you specify how many digits
you’d like to retain for the mean, SD, and SE.
# Summarize the data
supportR::summary_table(data = penguins, groups = c("species", "island"),
                        response = "bill_length_mm", drop_na = TRUE)
#>     species    island  mean std_dev sample_size std_error
#> 1    Adelie    Biscoe 38.98    2.48          44      0.37
#> 2    Adelie     Dream 38.50    2.47          56      0.33
#> 3    Adelie Torgersen 38.95    3.03          52      0.42
#> 4 Chinstrap     Dream 48.83    3.34          68      0.41
#> 5    Gentoo    Biscoe 47.50    3.08         124      0.28The safe_rename function allows–perhaps
predictably–“safe” renaming of column names in a given data object. The
‘bad’ column names and corresponding ‘good’ names must be specified. The
order of entries in the two vectors must match (i.e., the first bad name
will be replaced with the first good name), but that order need not
match the order in which they occur in the data!
When loading text data into R you may have encountered pernicious
‘special’ characters that can affect the meaning of character data.
These characters are often non-ASCII (see here for more context on
“ASCII” characters) and supportR offers the
replace_non_ascii function to find and replace these
characters with visually similar equivalents in a given vector.
Note that letters with accents above or below them (e.g., umlaut,
accent circumflex, etc.) are technically non-ASCII characters. Because
they are often a part of proper nouns rather than being categorically
incorrect I’ve split off whether they should be replaced into an
include_letters argument that defaults to
FALSE.
Also, non-ASCII characters trigger errors in the R package submission
process so the example below uses “u-escape” codes for ASCII characters
to still demonstrate the function without causing errors when I submit
version updates of supportR to CRAN.
crop_tri allows dropping one “triangle” of a symmetric
dataframe / matrix. It also includes a drop_diag argument
that accepts a logical for whether to drop the diagonal of the data
object. This is primarily useful (I find) in allowing piping through
this function as opposed to using the base R notation for removing a
triangle of a symmetric data object.
# Define a simple matrix wtih symmetric dimensions
mat <- matrix(data = c(1:2, 2:1), nrow = 2, ncol = 2)
# Crop off it's lower triangle
supportR::crop_tri(data = mat, drop_tri = "lower", drop_diag = FALSE)
#>      [,1] [,2]
#> [1,]    1    2
#> [2,]   NA    1
# Drop the diagonal as well
supportR::crop_tri(data = mat, drop_tri = "lower", drop_diag = TRUE)
#>      [,1] [,2]
#> [1,]   NA    2
#> [2,]   NA   NAarray_melt allows users to ‘melt’ an array of dimensions
X, Y, and Z into a dataframe containing columns “x”, “y”, “z”, and
“value” where “value” is whatever was stored at those coordinates in the
array.
# Make data to fill the array
vec1 <- c(5, 9, 3)
vec2 <- c(10:15)
# Create dimension names (x = col, y = row, z = which matrix)
x_vals <- c("Col_1","Col_2","Col_3")
y_vals <- c("Row_1","Row_2","Row_3")
z_vals <- c("Mat_1","Mat_2")
# Make an array from these components
g <- array(data = c(vec1, vec2), dim = c(3, 3, 2),
           dimnames = list(x_vals, y_vals, z_vals))
# "Melt" the array into a dataframe
melted <- supportR::array_melt(array = g)
# Look at that top of that
head(melted)
#>       z     y     x value
#> 1 Mat_1 Col_1 Row_1     5
#> 2 Mat_1 Col_1 Row_2    10
#> 3 Mat_1 Col_1 Row_3    13
#> 4 Mat_1 Col_2 Row_1     9
#> 5 Mat_1 Col_2 Row_2    11
#> 6 Mat_1 Col_2 Row_3    14