mdiby / googleCloudStorageR

Google Cloud Storage API to R

Home Page:http://code.markedmondson.me/googleCloudStorageR/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

googleCloudStorageR

R library for interacting with the Google Cloud Storage JSON API (api docs).

Setup

Google Cloud Storage charges you for storage (prices here).

You can use your own Google Project with a credit card added to create buckets, where the charges will apply. This can be done in the Google API Console

Configuring your own Google Project

The instructions below are for when you visit the Google API console (https://console.developers.google.com/apis/)

For local use

  1. Click 'Create a new Client ID', and choose "Installed Application".

  2. Download the client ID JSON.

  3. Set the client ID via googleAuthR::gar_set_client():

     googleAuthR::gar_set_client("your-json-file.json")
    

For Shiny use

  1. Click 'Create a new Client ID', and choose "Web Application".

  2. Download the client ID JSON.

  3. Add the URL of where your Shiny app will run, with no port number. e.g. https://mark.shinyapps.io/searchConsoleRDemo/

  4. And/Or also put in localhost or 127.0.0.1 with a port number for local testing. Remember the port number you use as you will need it later to launch the app e.g. http://127.0.0.1:1221

  5. Set the web client ID via googleAuthR::gar_set_client():

     googleAuthR::gar_set_client(web_json = "your-json-file.json")
    
  6. To run the app locally specifying the port number you used in step 4 e.g. shiny::runApp(port=1221) or set a shiny option to default to it: options(shiny.port = 1221) and launch via the RunApp button in RStudio.

  7. Running on your Shiny Server will work only for the URL from step 3.

Activate API

  1. Click on "APIs"

  2. Select and activate the Cloud Storage JSON API

  3. After loading the package via library(googleCloudStorage), it will look to see if "https://www.googleapis.com/auth/devstorage.full_control" is set in getOption("googleAuthR.scopes.selected") and set it if it is not, adding to the existing scopes.

  4. Alternativly, set the googleAuthR option for Google Cloud storage scope after the library has been loaded but before authentication.

     options(googleAuthR.scopes.selected = "https://www.googleapis.com/auth/devstorage.full_control")
    

Setting environment variables

By default, all cloudyr packages look for the access key ID and secret access key in environment variables. You can also use this to specify a default bucket, and auto-authentication upon attaching the library. For example:

Sys.setenv("GCS_DEFAULT_BUCKET" = "my-default-bucket",
           "GCS_AUTH_FILE" = "/fullpath/to/service-auth.json")

These can alternatively be set on the command line or via an Renviron.site or .Renviron file (see here for instructions).

Auto-authentication

To authenticate, you specify the location of a service account JSON file taken from your Google Project:

    Sys.setenv("GCS_AUTH_FILE" = "/fullpath/to/auth.json")

Examples

Setting a default Bucket

To avoid specifying the bucket in the functions below, you can set the name of your default bucket via environmental variables or via the function gcs_global_bucket(). See the Setting environment variables section below for more details.

## set bucket via environment
Sys.setenv("GCS_DEFAULT_BUCKET" = "my-default-bucket")

library(googleCloudStorageR)

## check what the default bucket is
gcs_get_global_bucket()
[1] "my-default-bucket"

## you can also set a default bucket after loading the library for that session
gcs_global_bucket("your-default-bucket-2")
gcs_get_global_bucket()
[1] "my-default-bucket-2"

Downloading objects from Google Cloud storage

Once you have a Google project and created a bucket with an object in it, you can download it as below:

library(googleCloudStorageR)

## get your project name from the API console
proj <- "your-project"

## get bucket info
buckets <- gcs_list_buckets(proj)
bucket <- "your-bucket"
bucket_info <- gcs_get_bucket(bucket)
bucket_info

==Google Cloud Storage Bucket==
Bucket:          your-bucket
Project Number:  1123123123
Location:        EU
Class:           STANDARD
Created:         2016-04-28 11:39:06
Updated:         2016-04-28 11:39:06
Meta-generation: 1
eTag:            Cxx=


## get object info in the default bucket
objects <- gcs_list_objects()

## save directly to an R object (warning, don't run out of RAM if its a big object)
## the download type is guessed into an appropriate R object
parsed_download <- gcs_get_object(objects$name[[1]])

## if you want to do your own parsing, set parseObject to FALSE
## use httr::content() to parse afterwards
raw_download <- gcs_get_object(objects$name[[1]],
                               parseObject = FALSE)

## save directly to a file in your working directory
## parseObject has no effect, it is a httr::content(req, "raw") download
gcs_get_object(objects$name[[1]], saveToDisk = "csv_downloaded.csv")

Uploading objects < 5MB

Objects can be uploaded via files saved to disk, or passed in directly if they are data frames or list type R objects. By default, data frames will be converted to CSV via write.csv(), lists to JSON via jsonlite::toJSON.

If you want to use other functions for transforming R objects, for example setting row.names = FALSE or using write.csv2, pass the function through object_function

## upload a file - type will be guessed from file extension or supply type
write.csv(mtcars, file = filename)
gcs_upload(filename)

## upload an R data.frame directly - will be converted to csv via write.csv
gcs_upload(mtcars)

## upload an R list - will be converted to json via jsonlite::toJSON
gcs_upload(list(a = 1, b = 3, c = list(d = 2, e = 5)))

## upload an R data.frame directly, with a custom function
## function should have arguments 'input' and 'output'
## safest to supply type too
f <- function(input, output) write.csv(input, row.names = FALSE, file = output)

gcs_upload(mtcars,
           object_function = f,
           type = "text/csv")

Upload metadata

You can pass metadata with an object via the function gcs_metadata_object().

the name you pass to the metadata object will override the name if it is also set elsewhere.

meta <- gcs_metadata_object("mtcars.csv",
                             metadata = list(custom1 = 2,
                                             custom_key = 'dfsdfsdfsfs))

gcs_upload(mtcars, object_metadata = meta)

Resumable uploads for files > 5MB up to 5TB

If the file/object is under 5MB, simple uploads are used.

For files > 5MB, resumable uploads are used. This allows you to upload up to 5TB.

If you get an interrupted connection when uploading, gcs_upload will retry 3 times, if it fails it will return a Retry object, that you can try again later from where the upload stopped. Call this via gcs_retry_upload

## write a big object to a file
big_file <- "big_filename.csv"
write.csv(big_object, file = big_file)

## attempt upload
upload_try <- gcs_upload(big_file)

## if successful, upload_try is an object metadata object
upload_try
==Google Cloud Storage Object==
Name:            "big_filename.csv"
Size:            8.5 Gb
Media URL        https://www.googleapis.com/download/storage/v1/b/xxxx
Bucket:          your-bucket
ID:              your-bucket/"test.pdf"/xxxx
MD5 Hash:        rshao1nxxxxxY68JZQ==
Class:           STANDARD
Created:         2016-08-12 17:33:05
Updated:         2016-08-12 17:33:05
Generation:      1471023185977000
Meta Generation: 1
eTag:            CKi90xxxxxEAE=
crc32c:          j4i1sQ==


## if unsuccessful after 3 retries, upload_try is a Retry object
==Google Cloud Storage Upload Retry Object==
File Location:     big_filename.csv
Retry Upload URL:  http://xxxx
Created:           2016-08-12 17:33:05
Type:              csv
File Size:        8.5 Gb
Upload Byte:      4343
Upload remaining: 8.1 Gb

## you can retry to upload the remaining data using gcs_retry_upload()
try2 <- gcs_retry_upload(upload_try)

Updating user access to objects

You can change who can access objects via gcs_update_acl to one of READER or OWNER, on a user, group, domain, project or public for all users or authenticated users.

By default you are "OWNER" of all the objects and buckets you upload and create.

## update access of object to READER for all public
gcs_update_object_acl("your-object.csv", entity_type = "allUsers")

## update access of object for user joe@blogs.com to OWNER
gcs_update_acl("your-object.csv",
               entity = "joe@blogs.com",
               role = "OWNER")

## update access of object for googlegroup users to READER
gcs_update_object_acl("your-object.csv",
                      entity = "my-group@googlegroups.com",
                      entity_type = "group")

## update access of object for all users to OWNER on your Google Apps domain
gcs_update_object_acl("your-object.csv",
                      entity = "yourdomain.com",
                      entity_type = "domain",
                      role = OWNER)

Deleting an object

Delete an object by passing its name (and bucket if not default)

## returns TRUE is successful, a 404 error if not found
gcs_delete_object("your-object.csv")

Viewing current access level to objects

Use gcs_get_object_acl() to see what the current access is for an entity + entity_type.

## default entity_type is user
acl <- gcs_get_object_acl("your-object.csv",
                         entity = "joe@blogs.com")
acl$role
[1] "OWNER"

## for allUsers and allAuthenticated users, you don't need to supply entity
acl <- gcs_get_object_acl("your-object.csv",
                          entity_type = "allUsers")
acl$role
[1] "READER"

Creating download links

Once a user (or group or the public) has access, they can reach that object via a download link generated by the function gcs_download_url

download_url <- gcs_download_url("your-object.csv")
download_url
[1] "https://storage.cloud.google.com/your-project/your-object.csv"

Signed URLs

You can create temporary links for users who may not have a Google account, but still need to be private. This is achieved using the gcs_signed_url function, which you pass a meta object too.

obj <- gcs_get_object("your_file", meta = TRUE)

signed <- gcs_signed_url(obj)

The default is for the link to be accessible for an hour, but you can alter that:

## a link that will expire in 24 hours (86400 seconds) from now.
24hours_signed <- gcs_signed_url(obj, expiration_ts = Sys.time() + 86400)

R Session helpers

Versions of save.image(), save() and load() are implemented called gcs_save_image(), gcs_save() and gcs_load(). These functions save and load the global R session to the cloud.

## save the current R session including all objects
gcs_save_image()

### wipe environment
rm(list = ls())

## load up environment again
gcs_load()

Save specific objects:

cc <- 3
d <- "test1"
gcs_save("cc","d", file = "gcs_save_test.RData")

## remove the objects saved in cloud from local environment
rm(cc,d)

## load them back in from GCS
gcs_load(file = "gcs_save_test.RData")
cc == 3
[1] TRUE
d == "test1"
[1] TRUE

You can also upload .R code files and source them directly using gcs_source:

## make a R source file and upload it
cat("x <- 'hello world!'\nx", file = "example.R")
gcs_upload("example.R", name = "example.R")

## source the file to run its code
gcs_source("example.R")

## the code from the upload file has run
x
[1] "hello world!"

Uploading via a Shiny app

The library is also compatible with Shiny authentication flows, so you can create Shiny apps that lets users log in and upload their own data.

An example of that is shown below:

library("shiny")
library("googleAuthR")
library("googleCloudStorageR")
options(googleAuthR.scopes.selected = "https://www.googleapis.com/auth/devstorage.full_control")
## optional, if you want to use your own Google project
# options("googleAuthR.client_id" = "YOUR_CLIENT_ID")
# options("googleAuthR.client_secret" = "YOUR_CLIENT_SECRET")

## you need to start Shiny app on port 1221
## as thats what the default googleAuthR project expects for OAuth2 authentication

## options(shiny.port = 1221)
## print(source('shiny_test.R')$value) or push the "Run App" button in RStudio

shinyApp(
  ui = shinyUI(
      fluidPage(
        googleAuthR::googleAuthUI("login"),
        fileInput("picture", "picture"),
        textInput("filename", label = "Name on Google Cloud Storage",value = "myObject"),
        actionButton("submit", "submit"),
        textOutput("meta_file")
      )
  ),
  server = shinyServer(function(input, output, session){

    access_token <- shiny::callModule(googleAuth, "login")

    meta <- eventReactive(input$submit, {

      message("Uploading to Google Cloud Storage")

      # from googleCloudStorageR
      with_shiny(gcs_upload,
                 file = input$picture$datapath,
                 # enter your bucket name here
                 bucket = "gogauth-test",
                 type = input$picture$type,
                 name = input$filename,
                 shiny_access_token = access_token())

    })

    output$meta_file <- renderText({

      req(meta())

      str(meta())

      paste("Uploaded: ", meta()$name)

    })

  })
)

Bucket administration

There are various functions to manipulate Buckets:

  • gcs_list_buckets
  • gcs_get_bucket
  • gcs_create_bucket

Object administration

You can get meta data about an object by passing meta=TRUE to gcs_get_object

gcs_get_object("your-object", "your-bucket", meta = TRUE)

Installation

CRAN Build Status codecov.io

This package is on CRAN:

# latest stable version
install.packages("googleCloudStorageR")

Or, to pull a potentially unstable version directly from GitHub:

if(!require("ghit")){
    install.packages("ghit")
}
ghit::install_github("cloudyr/googleCloudStorageR")

cloudyr project logo

About

Google Cloud Storage API to R

http://code.markedmondson.me/googleCloudStorageR/

License:Other


Languages

Language:R 99.4%Language:Shell 0.6%