The cloudml package provides an R interface to Google Cloud Machine Learning Engine, a managed service that enables:
Scalable training of models built with the keras, tfestimators, and tensorflow R packages.
On-demand access to training on GPUs, including the new Tesla P100 GPUs from NVIDIA®.
Hyperparameter tuning to optmize key attributes of model architectures in order to maximize predictive accuracy.
Deployment of trained models to the Google global prediction platform that can support thousands of users and TBs of data.
CloudML is a managed service where you pay only for the hardware resources that you use. Prices vary depending on configuration (e.g. CPU vs. GPU vs. multiple GPUs). See https://cloud.google.com/vertex-ai/pricing for additional details.
Before you can begin training models with CloudML you need to have a Google Cloud Account. If you don’t already have an account you can create one at https://console.cloud.google.com.
If you are a new customer of Google Cloud you will receive a 12-month, $300 credit that can be applied to your use of CloudML. In addition, Google is providing a $200 credit for users of the R interface to CloudML (this credit applies to both new and existing customers). Use this link to apply for the $200 credit.
The account creation process will lead you through creating a new project. To enable the Machine Learning API for this project navigate to the “ML Engine” menu on the left. Doing this for the first time will enable the ML API and allow you to submit ML jobs.
Start by installing the cloudml R package from CRAN as follows:
Then, install the Google Cloud SDK, a set of utilties that
enable you to interact with your Google Cloud account from within R. You
can install the SDK using the gcloud_install()
function.
Note that in order to ensure that the cloudml
package can find your installation of the SDK you should accept the
default installation location (~/) suggested within the
installer.
As part of the installation you are asked to specify a default
account, project, and compute region for Google Cloud. These settings
are then used automatically for all CloudML jobs. To change the default
account, project, or region you can use the gcloud_init()
function:
Note that you don’t need to execute gcloud_init() now as
this was done automatically as part of
gcloud_install().
Once you’ve completed these steps you are ready to train models with CloudML!
To train a model on CloudML, first work the training script locally (perhaps with a smaller sample of your dataset). The script can contain arbitrary R code which trains and/or evaluates a model. Once you’ve confirmed that things work as expected, you can submit a CloudML job to perform training in the cloud.
To submit a job, call the cloudml_train() function,
specifying the R script to execute for training:
All of the files within the current working directory will be bundled up and sent along with the script to CloudML.
Note that the very first time you submit a job to CloudML the various packages required to run your script will be compiled from source. This will make the execution time of the job considerably longer that you might expect. It’s only the first job that incurs this overhead though (since the package installations are cached), and subsequent jobs will run more quickly.
If you are using RStudio v1.1 or higher, then the CloudML training job is monitored (and it’s results collected) using a background terminal:
When the job is complete, training results can be collected back to your local system (this is done automatically when monitoring the job using a background terminal in RStudio). A run report is displayed after the job is collected:
You can list all previous runs as a data frame using the
ls_runs() function:
Data frame: 6 x 37
                            run_dir eval_loss eval_acc metric_loss metric_acc metric_val_loss metric_val_acc
6 runs/cloudml_2018_01_26_135812740    0.1049   0.9789      0.0852     0.9760          0.1093         0.9770
2 runs/cloudml_2018_01_26_140015601    0.1402   0.9664      0.1708     0.9517          0.1379         0.9687
5 runs/cloudml_2018_01_26_135848817    0.1159   0.9793      0.0378     0.9887          0.1130         0.9792
3 runs/cloudml_2018_01_26_135936130    0.0963   0.9780      0.0701     0.9792          0.0969         0.9790
1 runs/cloudml_2018_01_26_140045584    0.1486   0.9682      0.1860     0.9504          0.1453         0.9693
4 runs/cloudml_2018_01_26_135912819    0.1141   0.9759      0.1272     0.9655          0.1087         0.9762
# ... with 30 more columns:
#   flag_dense_units1, flag_dropout1, flag_dense_units2, flag_dropout2, samples, validation_samples,
#   batch_size, epochs, epochs_completed, metrics, model, loss_function, optimizer, learning_rate,
#   script, start, end, completed, output, source_code, context, type, cloudml_console_url,
#   cloudml_created, cloudml_end, cloudml_job, cloudml_log_url, cloudml_ml_units, cloudml_start,
#   cloudml_stateYou can view run reports using the view_run()
function:
# view the latest run
view_run()
# view a specific run
view_run("runs/cloudml_2017_12_15_182614794")There are many tools available to list, filter, and compare training runs. For additional information see the documentation for the tfruns package.
By default, CloudML utilizes “standard” CPU-based instances suitable
for training simple models with small to moderate datasets. You can
request the use of other machine types, including ones with GPUs, using
the master_type parameter of
cloudml_train().
For example, the following would train the same model as above but with a Tesla K80 GPU:
To train using a Tesla P100 GPU you
would specify "standard_p100":
To train on a machine with 4 Tesla P100 GPU’s you would specify
"complex_model_m_p100":
See the CloudML website for documentation on available machine types. Also note that GPU instances can be considerably more expensive that CPU ones! See the documentation on CloudML Pricing for details. a ## Learning More
To learn more about using CloudML with R, see the following articles:
Training with CloudML goes into additional depth on managing training jobs and their output.
Hyperparameter Tuning explores how you can improve the performance of your models by running many trials with distinct hyperparameters (e.g. number and size of layers) to determine their optimal values.
Google Cloud Storage provides information on copying data between your local machine and Google Storage and also describes how to use data within Google Storage during training.
Deploying Models describes how to deploy trained models and generate predictions from them.