
adaR is a wrapper for ada-url, a WHATWG-compliant and fast URL parser written in modern C++ .
It implements several auxilliary functions to work with urls:
utils::URLdecode (~40x
speedup)More general information on URL parsing can be found in the
introductory vignette via vignette("adaR").
adaR is part of a series of R packages to analyse
webtracking data:
You can install the development version of adaR from GitHub with:
# install.packages("devtools")
devtools::install_github("gesistsa/adaR")The version on CRAN can be installed with
install.packages("adaR")This is a basic example which shows all the returned components of a URL.
library(adaR)
ada_url_parse("https://user_1:password_1@example.org:8080/dir/../api?q=1#frag")
#>                                                      href protocol username
#> 1 https://user_1:password_1@example.org:8080/api?q=1#frag   https:   user_1
#>     password             host    hostname port pathname search  hash
#> 1 password_1 example.org:8080 example.org 8080     /api   ?q=1 #frag  /*
   * https://user:pass@example.com:1234/foo/bar?baz#quux
   *       |     |    |          | ^^^^|       |   |
   *       |     |    |          | |   |       |   `----- hash_start
   *       |     |    |          | |   |       `--------- search_start
   *       |     |    |          | |   `----------------- pathname_start
   *       |     |    |          | `--------------------- port
   *       |     |    |          `----------------------- host_end
   *       |     |    `---------------------------------- host_start
   *       |     `--------------------------------------- username_end
   *       `--------------------------------------------- protocol_end
   */It solves some problems of urltools with more complex urls.
urltools::url_parse("https://www.google.com/maps/place/Pennsylvania+Station/@40.7519848,-74.0015045,14.
   7z/data=!4m5!3m4!1s0x89c259ae15b2adcb:0x7955420634fd7eba!8m2!3d40.750568!4d-73.993519")
#>   scheme                            domain port
#> 1  https 40.7519848,-74.0015045,14.\n   7z <NA>
#>                                                                                 path
#> 1 data=!4m5!3m4!1s0x89c259ae15b2adcb:0x7955420634fd7eba!8m2!3d40.750568!4d-73.993519
#>   parameter fragment
#> 1      <NA>     <NA>
ada_url_parse("https://www.google.com/maps/place/Pennsylvania+Station/@40.7519848,-74.0015045,14.7z/data=!4m
   5!3m4!1s0x89c259ae15b2adcb:0x7955420634fd7eba!8m2!3d40.750568!4d-73.993519")
#>                                                                                                                                                                         href
#> 1 https://www.google.com/maps/place/Pennsylvania+Station/@40.7519848,-74.0015045,14.7z/data=!4m   5!3m4!1s0x89c259ae15b2adcb:0x7955420634fd7eba!8m2!3d40.750568!4d-73.993519
#>   protocol username password           host       hostname port
#> 1   https:                   www.google.com www.google.com     
#>                                                                                                                                               pathname
#> 1 /maps/place/Pennsylvania+Station/@40.7519848,-74.0015045,14.7z/data=!4m   5!3m4!1s0x89c259ae15b2adcb:0x7955420634fd7eba!8m2!3d40.750568!4d-73.993519
#>   search hash
#> 1A “raw” url parse using ada is extremely fast (see ada-url.com) but for this to carry
over to R is tricky. The performance is still compatible with
urltools::url_parse with the noted advantage in accuracy in
some practical circumstances.
bench::mark(
  ada = ada_url_parse("https://user_1:password_1@example.org:8080/dir/../api?q=1#frag", decode = FALSE),
  urltools = urltools::url_parse("https://user_1:password_1@example.org:8080/dir/../api?q=1#frag"),
  check = FALSE
)
#> # A tibble: 2 × 6
#>   expression      min   median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr> <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
#> 1 ada           158µs    165µs     5913.        0B     45.3
#> 2 urltools      104µs    108µs     8488.        0B     42.6For further benchmark results, see benchmark.md in
data_raw.
There are four more groups of functions available to work with url parsing:
ada_get_*() get a specific componentada_has_*() check if a specific component is
presentada_set_*() set a specific component from URLSada_clear_*() remove a specific component from
URLSpublic_suffix() extracts their top level domain from the
public suffix list,
excluding private domains.
urls <- c(
  "https://subsub.sub.domain.co.uk",
  "https://domain.api.gov.uk",
  "https://thisisnotpart.butthisispartoftheps.kawasaki.jp"
)
public_suffix(urls)
#> [1] "co.uk"                            "gov.uk"                          
#> [3] "butthisispartoftheps.kawasaki.jp"If you are wondering about the last url. The list also contains
wildcard suffixes such as *.kawasaki.jp which need to be
matched.
The logo is created from this portrait of Ada Lovelace, a very early pioneer in Computer Science.