EXPLAIN (TYPE validate) (noctua # 225)
thanks to @tyner for
raising issuedbSendQuery (noctua # 223)
thanks to @tyner for
raising issueRAthena_options
(noctua #
226) thanks to @tyner for raising issueSelectedEngineVersion in
update_work_group (noctua # 224)
thanks to @tyner for
raising issuedbExistsTable to catch update AWS error
message.dbplyr 2.3.3.9000+AWS_ROLE_ARN. This caused
confusing when connecting through web identity (#177)dbplyr::in_catalog when working with
dplyr::tbl (#178)INFO: (Data scanned: -43839744 Bytes)
clear_s3_resource parameter to
RAthena_options to prevent AWS Athena output AWS S3
resource being cleared up by dbClearResult (#168). Thanks
to @juhoautio for
the request.boto3.session.Session class and client method
(#169)endpoint_override parameter allow default
endpoints for each service to be overridden accordingly (#169). Thanks
to @aoyh for the request
and checking the package in development.test_data to use
size parameter explicitly.RAthena_options to change 1 parameter at a time
without affecting other pre-configured settingsretry_quiet
parameter in RAthena_options function.dbplyr 2.0.0 backend API.dplyr to benefit from AWS Athena unload
methods (noctua #
174).dbGetQuery, dbExecute,
dbSendQuery, dbSendStatement work on older
versions of R (noctua #
170). Thanks to @tyner for identifying issue.AWS Athena UNLOAD
(noctua: #
160). This is to take advantage of read/write speed
parquet has to offer.import awswrangler as wr
import getpass
bucket = getpass.getpass()
path = f"s3://{bucket}/data/"
if "awswrangler_test" not in wr.catalog.databases().values:
wr.catalog.create_database("awswrangler_test")
cols = ["id", "dt", "element", "value", "m_flag", "q_flag", "s_flag", "obs_time"]
df = wr.s3.read_csv(
path="s3://noaa-ghcn-pds/csv/189",
names=cols,
parse_dates=["dt", "obs_time"]) # Read 10 files from the 1890 decade (~1GB)
wr.s3.to_parquet(
df=df,
path=path,
dataset=True,
mode="overwrite",
database="awswrangler_test",
table="noaa"
);
wr.catalog.table(database="awswrangler_test", table="noaa")library(DBI)
con <- dbConnect(RAthena::athena())
# Query ran using CSV output
system.time({
df = dbGetQuery(con, "SELECT * FROM awswrangler_test.noaa")
})
# Info: (Data scanned: 80.88 MB)
# user system elapsed
# 57.004 8.430 160.567
RAthena::RAthena_options(cache_size = 1)
# Query ran using UNLOAD Parquet output
system.time({
df = dbGetQuery(con, "SELECT * FROM awswrangler_test.noaa", unload = T)
})
# Info: (Data scanned: 80.88 MB)
# user system elapsed
# 21.622 2.350 39.232
# Query ran using cache
system.time({
df = dbGetQuery(con, "SELECT * FROM awswrangler_test.noaa", unload = T)
})
# Info: (Data scanned: 80.88 MB)
# user system elapsed
# 13.738 1.886 11.029 sql_translate_env correctly translates R functions
quantile and median to AWS Athena
equivalents (noctua #
153). Thanks to @ellmanj for spotting issue.AWS Athena
timestamp with time zone data type.list when converting data to
AWS Athena SQL format.library(data.table)
library(DBI)
x = 5
dt = data.table(
var1 = sample(LETTERS, size = x, T),
var2 = rep(list(list("var3"= 1:3, "var4" = list("var5"= letters[1:5]))), x)
)
con <- dbConnect(RAthena::athena())
#> Version: 2.2.0
sqlData(con, dt)
# Registered S3 method overwritten by 'jsonify':
# method from
# print.json jsonlite
# Info: Special characters "\t" has been converted to " " to help with Athena reading file format tsv
# var1 var2
# 1: 1 {"var3":[1,2,3],"var4":{"var5":["a","b","c","d","e"]}}
# 2: 2 {"var3":[1,2,3],"var4":{"var5":["a","b","c","d","e"]}}
# 3: 3 {"var3":[1,2,3],"var4":{"var5":["a","b","c","d","e"]}}
# 4: 4 {"var3":[1,2,3],"var4":{"var5":["a","b","c","d","e"]}}
# 5: 5 {"var3":[1,2,3],"var4":{"var5":["a","b","c","d","e"]}}
#> Version: 2.1.0
sqlData(con, dt)
# Info: Special characters "\t" has been converted to " " to help with Athena reading file format tsv
# var1 var2
# 1: 1 1:3|list(var5 = c("a", "b", "c", "d", "e"))
# 2: 2 1:3|list(var5 = c("a", "b", "c", "d", "e"))
# 3: 3 1:3|list(var5 = c("a", "b", "c", "d", "e"))
# 4: 4 1:3|list(var5 = c("a", "b", "c", "d", "e"))
# 5: 5 1:3|list(var5 = c("a", "b", "c", "d", "e"))v-2.2.0 now converts lists into json lines format so that AWS Athena
can parse with sql
array/mapping/json functions.
Small down side a s3 method conflict occurs when jsonify is
called to convert lists into json lines. jsonify was choose
in favor to jsonlite due to the performance improvements
(noctua #
156).
dbIsValid wrongly stated connection is valid for result
class when connection class was disconnected.sql_translate_env.paste broke with latest version of
dbplyr. New method is compatible with
dbplyr>=1.4.3 (noctua #
149).sql_translate_env: add support for
stringr/lubridate style functions, similar to
Postgres
backend.dbConnect add timezone parameter so that
time zone between R and AWS Athena is
consistent (noctua #
149).AthenaConnection class: ptr and
info slots changed from list to
environment with in AthenaConnect class.
Allows class to be updated by reference. Simplifies notation when
viewing class from RStudio environment tab.AthenaResult class: info slot changed from
list to environment. Allows class to be
updated by reference.By utilising environments for AthenaConnection and
AthenaResult, all AthenaResult classes created
from AthenaConnection will point to the same
ptr and info environments for it’s connection.
Previously ptr and info would make a copy.
This means if it was modified it would not affect the child or parent
class for example:
# Old Method
library(DBI)
con <- dbConnect(RAthena::athena(),
rstudio_conn_tab = F)
res <- dbExecute(con, "select 'helloworld'")
# modifying parent class to influence child
con@info$made_up <- "helloworld"
# nothing happened
res@connection@info$made_up
# > NULL
# modifying child class to influence parent
res@connection@info$made_up <- "oh no!"
# nothing happened
con@info$made_up
# > "helloworld"
# New Method
library(DBI)
con <- dbConnect(RAthena::athena(),
rstudio_conn_tab = F)
res <- dbExecute(con, "select 'helloworld'")
# modifying parent class to influence child
con@info$made_up <- "helloworld"
# picked up change
res@connection@info$made_up
# > "helloworld"
# modifying child class to influence parent
res@connection@info$made_up <- "oh no!"
# picked up change
con@info$made_up
# > "oh no!"AWS Athena data types
[array, row, map, json, binary, ipaddress] (noctua: #
135). Conversion types can be changed through dbConnect
and RAthena_options.library(DBI)
library(RAthena)
# default conversion methods
con <- dbConnect(RAthena::athena())
# change json conversion method
RAthena_options(json = "character")
RAthena:::athena_option_env$json
# [1] "character"
# change json conversion to custom method
RAthena_options(json = jsonify::from_json)
RAthena:::athena_option_env$json
# function (json, simplify = TRUE, fill_na = FALSE, buffer_size = 1024)
# {
# json_to_r(json, simplify, fill_na, buffer_size)
# }
# <bytecode: 0x7f823b9f6830>
# <environment: namespace:jsonify>
# change bigint conversion without affecting custom json conversion methods
RAthena_options(bigint = "numeric")
RAthena:::athena_option_env$json
# function (json, simplify = TRUE, fill_na = FALSE, buffer_size = 1024)
# {
# json_to_r(json, simplify, fill_na, buffer_size)
# }
# <bytecode: 0x7f823b9f6830>
# <environment: namespace:jsonify>
RAthena:::athena_option_env$bigint
# [1] "numeric"
# change binary conversion without affect, bigint or json methods
RAthena_options(binary = "character")
RAthena:::athena_option_env$json
# function (json, simplify = TRUE, fill_na = FALSE, buffer_size = 1024)
# {
# json_to_r(json, simplify, fill_na, buffer_size)
# }
# <bytecode: 0x7f823b9f6830>
# <environment: namespace:jsonify>
RAthena:::athena_option_env$bigint
# [1] "numeric"
RAthena:::athena_option_env$binary
# [1] "character"
# no conversion for json objects
con2 <- dbConnect(RAthena::athena(), json = "character")
# use custom json parser
con <- dbConnect(RAthena::athena(), json = jsonify::from_json)rstudio_conn_tab within dbConnect.AWS Athena uses float data type for the
DDL only, RAthena was wrongly parsing float
data type back to R. Instead AWS Athena uses data type
real in SQL functions like select cast
https://docs.aws.amazon.com/athena/latest/ug/data-types.html.
RAthena now correctly parses real to R’s data
type double (noctua: #
133)AWS returns to get all
results from AWS Glue catalogue (noctua: #
137)dbGetPartition. This
simply tidies up the default AWS Athena partition format.library(DBI)
library(RAthena)
con <- dbConnect(athena())
dbGetPartition(con, "test_df2", .format = T)
# Info: (Data scanned: 0 Bytes)
# year month day
# 1: 2020 11 17
dbGetPartition(con, "test_df2")
# Info: (Data scanned: 0 Bytes)
# partition
# 1: year=2020/month=11/day=17bigint, this is
to align with other DBI interfaces i.e. RPostgres. Now
bigint can be return in the possible formats: [“integer64”,
“integer”, “numeric”, “character”]library(DBI)
con <- dbConnect(RAthena::athena(), bigint = "numeric")
When switching between the different file parsers the
bigint to be represented according to the file parser
i.e. data.table: “integer64” -> vroom:
“I”.
dbRemoveTable: Check if key has “.” or ends with “/”
before adding “/” to the end (noctua: #
125)Error: write_parquet requires the arrow package, please install it first and try again
sql_escape_date into
dplyr_integration.R backend (#121). Thanks to @OssiLehtinen for
developing Athena date translation.RAthena to append to a static AWS s3 location
using uuiduse_deprecated_int96_timestamps set to TRUE.
This puts POSIXct data type in to java.sql.Timestamp
compatible format, such as yyyy-MM-dd HH:mm:ss[.f...].
Thanks to Christian N Wolz for highlight this issue.s3_upload_location simplified how s3 location is built.
Now s3.location parameter isn’t affected and instead only additional
components e.g. name, schema and partition.dbplyr v-2.0.0 function in_schema now
wraps strings in quotes, this breaks
db_query_fields.AthenaConnection. Now
db_query_fields.AthenaConnection removes any quotation from
the string so that it can search AWS GLUE for table
metadata. (noctua: #
117)R has been
interrupt a new parameter has been added to dbConnect,
keyboard_interrupt. Example:# Stop AWS Athena when R has been interrupted:
con <- dbConnect(RAthena::athena())
# Let AWS Athena keep running when R has been interrupted:
con <- dbConnect(RAthena::athena(),
keyboard_interrupt = F)RAthena would return a
data.frame for utility SQL queries regardless
of backend file parser. This is due to AWS Athena
outputting SQL UTILITY queries as a text file that required
to be read in line by line. Now RAthena will return the
correct data format based on file parser set in
RAthena_options for example:
RAthena_options("vroom") will return
tibbles.dbClearResult when user doesn’t have permission to delete
AWS S3 objects (noctua: #
96)RAthena_options contains 2 new parameters to control
how RAthena handles retries.dbFetch is able to return data from AWS Athena in
chunk. This has been achieved by passing NextToken to
AthenaResult s4 class. This method won’t be as fast
n = -1 as each chunk will have to be process into data
frame format.library(DBI)
con <- dbConnect(RAthena::athena())
res <- dbExecute(con, "select * from some_big_table limit 10000")
dbFetch(res, 5000)dbWriteTable opts to use alter table instead
of standard msck repair table. This is to improve
performance when appending to tables with high number of existing
partitions.dbWriteTable now allows json to be appended to json
ddls created with the Openx-JsonSerDe library.dbConvertTable brings dplyr::compute
functionality to base package, allowing RAthena to use the
power of AWS Athena to convert tables and queries to more efficient file
formats in AWS S3 (#37).dplyr::compute to give same functionality of
dbConvertTableboto3 not being detected
has been updated. This is due to several users not sure how to get
RAthena set-up.stop("Boto3 is not detected please install boto3 using either: `pip install boto3 numpy` in terminal or `install_boto()`.",
"\nIf this doesn't work please set the python you are using with `reticulate::use_python()` or `reticulate::use_condaenv()`",
call. = FALSE)
region_name check before making a connection to
AWS Athena (#110)dbWriteTable would throw throttling error
every now and again, retry_api_call as been built to handle
the parsing of data between R and AWS S3.dbWriteTable did not clear down all metadata when
uploading to AWS AthenadbWriteTable added support ddl structures for user who
have created ddl’s outside of RAthenaRAthena retry
functionality\dontrun
(#108)pyathena, RAthena_options now
has a new parameter cache_size. This implements local
caching in R environments instead of using AWS
list_query_executions. This is down to
dbClearResult clearing S3’s Athena output when caching
isn’t disabledRAthena_options now has clear_cache
parameter to clear down all cached data.dbRemoveTable now utilise AWS Glue to
remove tables from AWS Glue catalogue. This has a
performance enhancement:library(DBI)
con = dbConnect(RAthena::athena())
# upload iris dataframe for removal test
dbWriteTable(con, "iris2", iris)
# Athena method
system.time(dbRemoveTable(con, "iris2", confirm = T))
# user system elapsed
# 0.131 0.037 2.404
# upload iris dataframe for removal test
dbWriteTable(con, "iris2", iris)
# Glue method
system.time(dbRemoveTable(con, "iris2", confirm = T))
# user system elapsed
# 0.065 0.009 1.303 dbWriteTable now supports uploading json lines
(http://jsonlines.org/) format up to AWS Athena (#88).library(DBI)
con = dbConnect(RAthena::athena())
dbWriteTable(con, "iris2", iris, file.type = "json")
dbGetQuery(con, "select * from iris2")dbWriteTable appending to existing table compress file
type was incorrectly return.install_boto added numpy to
RAthena environment install as reticulate
appears to favour environments with numpy
(https://github.com/rstudio/reticulate/issues/216)Rstudio connection tab comes into an issue when Glue
Table isn’t stored correctly (#92)AWS_REGION into
dbConnectfwrite (>=1.12.4)
https://github.com/Rdatatable/data.table/blob/master/NEWS.mdsql_translate_env
(#44)# Before
dbplyr::translate_sql("2019-01-01", con = con)
# '2019-01-01'
# Now
dbplyr::translate_sql("2019-01-01", con = con)
# DATE '2019-01-01'paste/paste0 would use default
dplyr:sql-translate-env (concat_ws).
paste0 now uses Presto’s concat function and
paste now uses pipes to get extra flexibility for custom
separating values.# R code:
paste("hi", "bye", sep = "-")
# SQL translation:
('hi'||'-'||'bye')append set to
TRUE then existing s3.location will be utilised (#73)db_compute returned table name, however when a user
wished to write table to another location (#74). An error would be
raised:
Error: SYNTAX_ERROR: line 2:6: Table awsdatacatalog.default.temp.iris does not exist
This has now been fixed with db_compute returning
dbplyr::in_schema.library(DBI)
library(dplyr)
con <- dbConnect(RAthena::athena())
tbl(con, "iris") %>%
compute(name = "temp.iris")dbListFields didn’t display partitioned columns. This
has now been fixed with the call to AWS Glue being altered to include
more metadata allowing for column names and partitions to be
returned.dbListFieldsRAthena_options
vroom has been restricted to >= 1.2.0
due to integer64 support and changes to vroom apidbStatistics is a wrapper around boto3
get_query_execution to return statistics for
RAthena::dbSendQuery results (#67)dbGetQuery has new parameter statistics to
print out dbStatistics before returning Athena results
(#67)s3.location now follows new syntax
s3://bucket/{schema}/{table}/{partition}/{table_file} to
align with Pyathena and to allow tables with same name but
in different schema to be uploaded to s3 (#73).dplyr::tbl when calling Athena when using the ident
method (noctua # 64):library(DBI)
library(dplyr)
con <- dbConnect(RAthena::athena())
# ident method:
t1 <- system.time(tbl(con, "iris"))
# sub query method:
t2 <- system.time(tbl(con, sql("select * from iris")))
# ident method
# user system elapsed
# 0.082 0.012 0.288
# sub query method
# user system elapsed
# 0.993 0.138 3.660 dplyr sql_translate_env: expected results have now been
updated to take into account bug fix with date fieldsdata.table to vroom. From now on
it is possible to change file parser using RAthena_options
for example:library(RAthena)
RAthena_options("vroom")dbGetTables that returns Athena hierarchy
as a data.framevroomUpdated R documentation to roxygen2 7.0.2
dbWriteTable append parameter checks and uses
existing AWS Athena DDL file type. If file.type doesn’t
match Athena DDL file type then user will receive a warning
message:warning('Appended `file.type` is not compatible with the existing Athena DDL file type and has been converted to "', File.Type,'".', call. = FALSE)tolower conversion due to request #41dbRemoveTable can now remove S3 files for AWS Athena table
being removed.as.character was getting wrongly
translated #45INTEGER being incorrectly translated
in sql_translate_env.Rdata-transferdbRemoveTable new parameters are added in unit
testsql_translate_env until test to cater bug
fixdbWriteTable now will split gzip
compressed files to improve AWS Athena performance. By default
gzip compressed files will be split into 20.Performance results
library(DBI)
X <- 1e8
df <- data.frame(w =runif(X),
x = 1:X,
y = sample(letters, X, replace = T),
z = sample(c(TRUE, FALSE), X, replace = T))
con <- dbConnect(RAthena::athena())
# upload dataframe with different splits
dbWriteTable(con, "test_split1", df, compress = T, max.batch = nrow(df), overwrite = T) # no splits
dbWriteTable(con, "test_split2", df, compress = T, max.batch = 0.05 * nrow(df), overwrite = T) # 20 splits
dbWriteTable(con, "test_split3", df, compress = T, max.batch = 0.1 * nrow(df), overwrite = T) # 10 splitsAWS Athena performance results from AWS console (query executed:
select count(*) from .... ):
library(DBI)
X <- 1e8
df <- data.frame(w =runif(X),
x = 1:X,
y = sample(letters, X, replace = T),
z = sample(c(TRUE, FALSE), X, replace = T))
con <- dbConnect(RAthena::athena())
dbWriteTable(con, "test_split1", df, compress = T, overwrite = T) # default will now split compressed file into 20 equal size files.Added information message to inform user about what files have been added to S3 location if user is overwriting an Athena table.
copy_to method now supports compress and max_batch, to
align with dbWriteTabledbWriteTablePOSIXct to Athena. This class was
convert incorrectly and AWS Athena would return NA instead.
RAthena will now correctly convert POSIXct to
timestamp but will also correct read in timestamp into
POSIXctNA in string format. Before RAthena would
return NA in string class as "" this has now
been fixed.RAthena would translate output into a vector with current
the method dbFetch n = 0.sql_translate_env. Previously RAthena
would take the default dplyr::sql_translate_env, now
RAthena has a custom method that uses Data types from:
https://docs.aws.amazon.com/athena/latest/ug/data-types.html and window
functions from:
https://docs.aws.amazon.com/athena/latest/ug/functions-operators-reference-section.htmlPOSIXct class has now been added to data transfer unit
testdplyr sql_translate_env tests if R functions are
correct translated in to Athena sql syntax.dbWriteTable is called. The bug is due to function
sqlCreateTable which dbWriteTable calls.
Parameters table and fields were set to
NULL. This has now been fixed.s3.location parameter is dbWriteTable can
now be made nullablesqlCreateTable info message will now only inform user
if colnames have changed and display the column name that have
changedupload_data has been rebuilt and
removed the old “horrible” if statement with paste now the
function relies on sprintf to construct the s3 location
path. This method now is a lot clearer in how the s3 location is created
plus it enables a dbWriteTable to be simplified.
dbWriteTable can now upload data to the default s3_staging
directory created in dbConnect this simplifies
dbWriteTable to :library(DBI)
con <- dbConnect(RAthena::athena())
dbWrite(con, "iris", iris)dbWriteTabledata transfer test now tests compress, and default
s3.location when transferring datadata.table::fread. This enables data types to be read in
correctly and not required a second stage to convert data types once
data has been read into Rdata.table::fread and
data.table::fwrite have been disabledutil functions from namespace:
write.table, read.csvdata.table to namespacebigint are convert into R
bit64::integer64 and visa versabigint to integer64 in data.transfer
unit testdbConnect methoddbFetch with chunk sizes between 0 - 999.
Fixed error where for loop would return error instead of
breaking.py_error function, set call.
parameter to FALSEAthenaQuery s4 class changed to
AthenaResultdbFetch added datatype collectiondbFetch replaced S3 search for query key with output
location from AthenadbClearResult changed error, to return python error as
warning to warn user doesn’t have permission to delete S3 resourcedbClearResult replaced S3 search for query key with out
location from AthenadbListTables now returns vector of tables from
aws glue instead of using an AWS Athena query.
This method increases speed of call of querydbListFields now returns column names from
aws glue instead of using an AWS Athena
query.. This method increases speed of call of querydbExistsTable now returns boolean from
aws glue instead of using an AWS Athena
query.. This method increases speed of call of querycreate_work_group: Creates a workgroup with the
specified name.delete_work_group: Deletes the workgroup with the
specified name.list_work_group: Lists available workgroups for the
account.get_work_group: Returns information about the workgroup
with the specified name.update_work_group: Updates the workgroup with the
specified name. The workgroup’s name cannot be changed.get_session_token to
create temporary session credentialsassume_role to assume
AWS ARN RoledbConnectset_aws_env to set aws tokens
to environmental variablesget_aws_env to return expected
results from system variablestag_options to create tag
options for create_work_groupwork_group_config and
work_group_config_update to create config of work
groupAthenaConnectiondbColumnInfo method: returns data.frame
containing field_name and typetime_check to check how long is
left on the Athena Connection, if less than 15 minutes a warning message
is outputted to notify userdb_collect for better
integration with dplyrdb_save_query for better
integration with dplyrdb_copy_to for better
integration with dplyrdbFetch Athena data type miss alignmentAthenaConnection:
request build Athena query
requestdb_descdbConnectstop_query_execution to
dbClearResult if the query is still runningdbWriteTable)waiter to poll, to
align with python’s polling