The goal of rcprd is to simplify the process of extracting and processing CPRD Aurum data into an ‘analysis-ready’ dataset which can be used for statistical analyses. This process is somewhat difficult in R, as the raw data is very large, provided in a large number of .txt files, which cannot all be read into the R workspace. rcprd utilises RSQLite to create SQLite databases which are stored on the hard disk. These are then queried to extract the required information for a cohort of interest. The processes follow closely that from the rEHR package.
For a detailed guide on how to use rcprd please see the user-guide vignette.
The package can be installed from CRAN as follows:
# install.packages("rcprd")You can install the development version of rcprd from GitHub with:
# install.packages("devtools")
# devtools::install_github("alexpate30/rcprd")This is a basic example which shows you how to create a dataset containing age. All data provided with package and utilised in this example is simulated.
Load rcprd:
library(rcprd)
#> Loading required package: data.tableCreate cohort based on patient files:
pat <- extract_cohort(filepath = system.file("aurum_data", package = "rcprd"))
str(pat)
#> 'data.frame':    12 obs. of  12 variables:
#>  $ patid         : chr  "1" "2" "3" "4" ...
#>  $ pracid        : int  49 79 98 53 62 54 49 79 98 53 ...
#>  $ usualgpstaffid: chr  "6" "11" "43" "72" ...
#>  $ gender        : int  2 1 1 2 2 1 2 1 1 2 ...
#>  $ yob           : int  1984 1932 1930 1915 1916 1914 1984 1932 1930 1915 ...
#>  $ mob           : int  NA NA NA NA NA NA NA NA NA NA ...
#>  $ emis_ddate    : Date, format: "1976-11-21" "1979-02-14" ...
#>  $ regstartdate  : Date, format: "1940-07-24" "1929-02-23" ...
#>  $ patienttypeid : int  58 21 81 10 45 85 58 21 81 10 ...
#>  $ regenddate    : Date, format: "1996-08-25" "1945-03-19" ...
#>  $ acceptable    : int  1 0 1 0 0 1 1 0 1 0 ...
#>  $ cprd_ddate    : Date, format: "1935-03-17" "1932-02-05" ...Connect to an SQLite database (in this example, we create a temporary file):
aurum_extract <- connect_database(file.path(tempdir(), "temp.sqlite"))Read in medical data (from the observation files) and add to the SQLite database.
cprd_extract(db = aurum_extract, 
             filepath = system.file("aurum_data", package = "rcprd"), 
             filetype = "observation")
#>   |                                                                              |                                                                      |   0%
#> Adding C:/Program Files/R/R-4.4.2/library/rcprd/aurum_data/aurum_allpatid_set1_extract_observation_001.txt 2024-11-14 15:20:22.632475
#>   |                                                                              |=======================                                               |  33%
#> Adding C:/Program Files/R/R-4.4.2/library/rcprd/aurum_data/aurum_allpatid_set1_extract_observation_002.txt 2024-11-14 15:20:22.746196
#>   |                                                                              |===============================================                       |  67%
#> Adding C:/Program Files/R/R-4.4.2/library/rcprd/aurum_data/aurum_allpatid_set1_extract_observation_003.txt 2024-11-14 15:20:22.83656
#>   |                                                                              |======================================================================| 100%Query the database for specific codes and store in an R object using
the db_query function:
### Create codelist
codelist <- "187341000000114"
### Query for observations with this code
db_query(db_open = aurum_extract,
         tab ="observation",
         codelist_vector = codelist)
#>     patid consid pracid  obsid obsdate enterdate staffid parentobsid
#>    <char> <char>  <int> <char>   <num>     <num>  <char>      <char>
#> 1:      1     42      1     81   -5373      4302      85          35
#> 2:      2     56      1     77   -5769    -13828      24           4
#> 3:      6     40      1     41  -14727     -6929      98          80
#>          medcodeid value numunitid obstypeid numrangelow numrangehigh probobsid
#>             <char> <num>     <int>     <int>       <num>        <num>    <char>
#> 1: 187341000000114    84        79        67          24           22         5
#> 2: 187341000000114    46        92        81          56           30        18
#> 3: 187341000000114    28        20         5          41           97        92Add an index date to the patient file, which we will extract variables relative to:
pat$fup_start <- as.Date("01/01/2020", format = "%d/%m/%Y")Extract a ‘history of’ type variable, which will be equal to 1 if an individual has a record with the specified medcodeid prior to the index date, and equal 0 otherwise.
ho <- extract_ho(pat, 
                 codelist_vector = codelist, 
                 indexdt = "fup_start", 
                 db_open = aurum_extract, 
                 tab = "observation",
                 return_output = TRUE)
str(ho)
#> 'data.frame':    12 obs. of  2 variables:
#>  $ patid: chr  "1" "2" "3" "4" ...
#>  $ ho   : int  1 1 0 0 0 1 0 0 0 0 ...Merge the patient file with the ‘history of’ variable to create an analysis-ready dataset:
### Recursive merge
analysis.ready.pat <- merge(pat[,c("patid", "fup_start", "gender")], ho, by.x = "patid", by.y = "patid", all.x = TRUE) 
analysis.ready.pat
#>    patid  fup_start gender ho
#> 1      1 2020-01-01      2  1
#> 2     10 2020-01-01      2  0
#> 3     11 2020-01-01      2  0
#> 4     12 2020-01-01      1  0
#> 5      2 2020-01-01      1  1
#> 6      3 2020-01-01      1  0
#> 7      4 2020-01-01      2  0
#> 8      5 2020-01-01      2  0
#> 9      6 2020-01-01      1  1
#> 10     7 2020-01-01      2  0
#> 11     8 2020-01-01      1  0
#> 12     9 2020-01-01      1  0Currently functionality exists in rcprd to extract medical data from
the observation file (including specific functions for extracting test
data) and medication data from the drugissue file. Low level functions
exist to allow the user to query the RSQLite database and write their
own functions to define variables of interest. There are mid-level
functions which allow users to extract variables of certain types
(‘history of’, ‘time to event’, and ‘most recent test result’). There
are then very high level functions which allow users to extract specific
variables, such as body mass index, systolic blood pressure, smoking
status, diabetes status, etc. There are all functions where decisions
have been made over how to define variables. Be sure to check the code
to make sure it matches with your definition. For example,
extract_diabetes will return a categorical variable with
three categories, Absent, type1 and
type2. If an individual has a record for both type 1 and
type 2 diabetes (according to the users code lists),
extract_diabetes will assign the individual to the group
type1.
This parts of this package which create the SQLite database are
somewhat dependent on the structure of the raw CPRD Aurum data. For
example, the functions to read in the raw text files
(e.g. extract_txt_obs) are hard coded to format variables
with specific names in a certain way (e.g. convert obsdate
from a character variable to a date variable). Over time, the structure
of the CPRD Aurum data may change, which could impact the utility of
this package. We will endeavor to keep rcprd updated with new releases
of CPRD Aurum. However, where possible, we have tried to protect against
this by giving the user flexible options as well as the defaults. For
example, add_to_database defaults to using
extract_txt_obs to read in the raw text data when
filetype = "observation" is specified. However, there is
also an option extract_txt_func, which allows the users to
specify their own function to read in the text data, and will override
the use of extract_txt_obs.
Despite this, there may have been breaking points we haven’t thought of, in which case please let us know.
If you encounter a bug, please file an issue with a minimal reproducible example on GitHub.