r-lib / sodium

R bindings to libsodium

Home Page:https://docs.ropensci.org/sodium

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Efficiently decrypting vectors of GPS coordinates

koenniem opened this issue · comments

I am working with a fairly large dataset containing GPS coordinates encrypted with sodium in another programme. Now I need to decrypt them for analysis, but I am not sure how to do so efficiently. Please see the example below to see how I am currently decrypting data.

library(sodium)

# Create some fake GPS coordinates
data <- replicate(
  n = 400000,
  expr = paste0(
    sample(0:50, size = 1), ".", 
    paste0(sample(0:9, size = 14, replace = TRUE), collapse = "")
  )
)

# Generate keypair
key <- keygen()
pub <- pubkey(key)

# Encrypt message with pubkey
# Efficiency doesn't matter here
# For some reason, serialize doesn't work for my data
msg <- lapply(data, charToRaw) 
ciphertext <- lapply(msg, function(x) simple_encrypt(x, pub))
ciphertext <- lapply(ciphertext, bin2hex)

# Now for uncrypting
# How to do it faster?
out <- lapply(ciphertext, hex2bin)
out <- lapply(out, simple_decrypt, key = key)
out <- lapply(out, rawToChar)
out <- unlist(out)
identical(out, data)
#> [1] TRUE

Created on 2022-12-02 with reprex v2.0.2

There are two components to this equation that slow down the process:

  1. sodium::hex2bin() only accepts one value.
  2. sodium::simple decrypt() only accepts one value.

Running hex2bin() on the encrypted data takes about 7 seconds on my machine, and decrypting takes 35 seconds. Please keep in mind that this is just an example; on real data, I would have to repeat this process many times. Normally, I would not know whether this is fast or slow (because I do not know much about encryption), but collapsing the ciphertext to a single string (using paste()) and then running hex2bin() provides a significant speed boost.

In an ideal world, you'd run vectorized functions:

ciphertext |>
  hex2bin() |>
  simple_decrypt(key = key) |>

However, this is not possible with sodium. Is there something wrong with my approach, or is this how one works with large vectors of data?