Match Regular Expressions with a Nicer ‘API’
A small wrapper on regular expression matching functions
regexpr and gregexpr to return the results in
tidy data frames.
install.packages("rematch2")Note that rematch2 is not compatible with the original
rematch package. There are at least three major changes: *
The order of the arguments for the functions is different. In
rematch2 the text vector is first, and
pattern is second. * In the result, .match is
the last column instead of the first. * rematch2 returns
tibble data frames. See
https://github.com/hadley/tibble.
library(rematch2)With capture groups:
dates <- c("2016-04-20", "1977-08-08", "not a date", "2016",
  "76-03-02", "2012-06-30", "2015-01-21 19:58")
isodate <- "([0-9]{4})-([0-1][0-9])-([0-3][0-9])"
re_match(text = dates, pattern = isodate)#> # A tibble: 7 x 5
#>      ``    ``    ``            .text     .match
#>   <chr> <chr> <chr>            <chr>      <chr>
#> 1  2016    04    20       2016-04-20 2016-04-20
#> 2  1977    08    08       1977-08-08 1977-08-08
#> 3  <NA>  <NA>  <NA>       not a date       <NA>
#> 4  <NA>  <NA>  <NA>             2016       <NA>
#> 5  <NA>  <NA>  <NA>         76-03-02       <NA>
#> 6  2012    06    30       2012-06-30 2012-06-30
#> 7  2015    01    21 2015-01-21 19:58 2015-01-21Named capture groups:
isodaten <- "(?<year>[0-9]{4})-(?<month>[0-1][0-9])-(?<day>[0-3][0-9])"
re_match(text = dates, pattern = isodaten)#> # A tibble: 7 x 5
#>    year month   day            .text     .match
#>   <chr> <chr> <chr>            <chr>      <chr>
#> 1  2016    04    20       2016-04-20 2016-04-20
#> 2  1977    08    08       1977-08-08 1977-08-08
#> 3  <NA>  <NA>  <NA>       not a date       <NA>
#> 4  <NA>  <NA>  <NA>             2016       <NA>
#> 5  <NA>  <NA>  <NA>         76-03-02       <NA>
#> 6  2012    06    30       2012-06-30 2012-06-30
#> 7  2015    01    21 2015-01-21 19:58 2015-01-21A slightly more complex example:
github_repos <- c(
    "metacran/crandb",
    "jeroenooms/curl@v0.9.3",
    "jimhester/covr#47",
    "hadley/dplyr@*release",
    "r-lib/remotes@550a3c7d3f9e1493a2ba",
    "/$&@R64&3"
)
owner_rx   <- "(?:(?<owner>[^/]+)/)?"
repo_rx    <- "(?<repo>[^/@#]+)"
subdir_rx  <- "(?:/(?<subdir>[^@#]*[^@#/]))?"
ref_rx     <- "(?:@(?<ref>[^*].*))"
pull_rx    <- "(?:#(?<pull>[0-9]+))"
release_rx <- "(?:@(?<release>[*]release))"
subtype_rx <- sprintf("(?:%s|%s|%s)?", ref_rx, pull_rx, release_rx)
github_rx  <- sprintf(
    "^(?:%s%s%s%s|(?<catchall>.*))$",
    owner_rx, repo_rx, subdir_rx, subtype_rx
)
re_match(text = github_repos, pattern = github_rx)#> # A tibble: 6 x 9
#>        owner    repo subdir                  ref  pull  release  catchall
#>        <chr>   <chr>  <chr>                <chr> <chr>    <chr>     <chr>
#> 1   metacran  crandb                                                     
#> 2 jeroenooms    curl                      v0.9.3                         
#> 3  jimhester    covr                                47                   
#> 4     hadley   dplyr                                   *release          
#> 5      r-lib remotes        550a3c7d3f9e1493a2ba                         
#> 6                                                               /$&@R64&3
#> # ... with 2 more variables: .text <chr>, .match <chr>Extract all names, and also first names and last names:
name_rex <- paste0(
  "(?<first>[[:upper:]][[:lower:]]+) ",
  "(?<last>[[:upper:]][[:lower:]]+)"
)
notables <- c(
  "  Ben Franklin and Jefferson Davis",
  "\tMillard Fillmore"
)
not <- re_match_all(notables, name_rex)
not#> # A tibble: 2 x 4
#>       first      last                              .text    .match
#>      <list>    <list>                              <chr>    <list>
#> 1 <chr [2]> <chr [2]>   Ben Franklin and Jefferson Davis <chr [2]>
#> 2 <chr [1]> <chr [1]>               "\tMillard Fillmore" <chr [1]>not$first#> [[1]]
#> [1] "Ben"       "Jefferson"
#> 
#> [[2]]
#> [1] "Millard"not$last#> [[1]]
#> [1] "Franklin" "Davis"   
#> 
#> [[2]]
#> [1] "Fillmore"not$.match#> [[1]]
#> [1] "Ben Franklin"    "Jefferson Davis"
#> 
#> [[2]]
#> [1] "Millard Fillmore"re_exec and re_exec_all are similar to
re_match and re_match_all, but they also
return match positions. These functions return match records. A match
record has three components: match, start,
end, and each component can be a vector. It is similar to a
data frame in this respect.
pos <- re_exec(notables, name_rex)
pos#> # A tibble: 2 x 4
#>        first       last                              .text     .match
#> *     <list>     <list>                              <chr>     <list>
#> 1 <list [3]> <list [3]>   Ben Franklin and Jefferson Davis <list [3]>
#> 2 <list [3]> <list [3]>               "\tMillard Fillmore" <list [3]>Unfortunately R does not allow hierarchical data frames (i.e. a
column of a data frame cannot be another data frame), but
rematch2 defines some special classes and an $
operator, to make it easier to extract parts of re_exec and
re_exec_all matches. You simply query the
match, start or end part of a
column:
pos$first$match#> [1] "Ben"     "Millard"pos$first$start#> [1] 3 2pos$first$end#> [1] 5 8re_exec_all is very similar, but these queries return
lists, with arbitrary number of matches:
allpos <- re_exec_all(notables, name_rex)
allpos#> # A tibble: 2 x 4
#>        first       last                              .text     .match
#>       <list>     <list>                              <chr>     <list>
#> 1 <list [3]> <list [3]>   Ben Franklin and Jefferson Davis <list [3]>
#> 2 <list [3]> <list [3]>               "\tMillard Fillmore" <list [3]>allpos$first$match#> [[1]]
#> [1] "Ben"       "Jefferson"
#> 
#> [[2]]
#> [1] "Millard"allpos$first$start#> [[1]]
#> [1]  3 20
#> 
#> [[2]]
#> [1] 2allpos$first$end#> [[1]]
#> [1]  5 28
#> 
#> [[2]]
#> [1] 8MIT © Mango Solutions, Gábor Csárdi