sparklyr / sparklyr

R interface for Apache Spark

Home Page:https://spark.rstudio.com/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Azure Databricks Connection Issue

aaronball85 opened this issue · comments

First, I appreciate Posit and Databricks work on databricks_connect.

I followed the Databricks Connect documentation to connect to our Azure Databricks workspace. Well-written documentation! I also appreciated James and Rafi's presentation at Posit::conf 2023.

After installing the required packages and locating the needed IDs from Databricks, I attempted a connection and received the Error: Unable to find conda binary. Is Anaconda installed? error. Using the documentation (linked above) and noting this error was included in the common errors, I followed the recommended action and I installed the dev version of the pysparklyr package and attempted the connection again. Another error, this time different:

Connection Code Run:

sc <- spark_connect(
  master = my_azure_databricks_workspace_url, 
  cluster_id = my_cluster_id, # cluster runtime is 14.0
  token = my_databricks_pat,
  method = "databricks_connect",
  dbr_version = "14.0"
)

Error Message:

! Retrieving version from cluster 'my_cluster_id'
Error in `cluster_dbr_info()`:
! Issues connecting to Databricks. Currently using:
|-- Host: 'my_azure_databricks_workspace_url'
|-- Cluster ID: 'my_cluster_id'
|-- Token: '<REDACTED>'
Error message: "Error : Timeout was reached: [my_azure_databricks_workspace_url] Failed to
connect to my_azure_databricks_workspace_url port 80 after 10004 ms: Timeout was reached "

Azure Databricks Instance Names are of the following form:
adb-5555555555555555.55.azuredatabricks.net

When looking at the Advanced Options of my cluster in Databricks, the Port listed is NOT port 80, but I don't see an option to change it. I'm not sure if there is an option to change the port setting somehow in spark_connect(). I'm not sure if this is the underlying issue either.

Session Info:
───────────────────────────
setting value
version R version 4.2.1 (2022-06-23)
os macOS Ventura 13.5.2
system x86_64, darwin17.0
ui RStudio
language (EN)
collate en_US.UTF-8
ctype en_US.UTF-8
tz America/Boise
date 2023-10-13
rstudio 2023.06.2+561 Mountain Hydrangea (desktop)
pandoc NA

─ Packages ─────────────────────────────
package * version date (UTC) lib source
abind 1.4-5 2016-07-21 [1] CRAN (R 4.2.0)
assertthat 0.2.1 2019-03-21 [1] CRAN (R 4.2.0)
backports 1.4.1 2021-12-13 [1] CRAN (R 4.2.0)
base64enc 0.1-3 2015-07-28 [1] CRAN (R 4.2.0)
broom 1.0.5 2023-06-09 [1] CRAN (R 4.2.0)
cachem 1.0.6 2021-08-19 [1] CRAN (R 4.2.0)
callr 3.7.2 2022-08-22 [1] CRAN (R 4.2.0)
car 3.1-1 2022-10-19 [1] CRAN (R 4.2.0)
carData 3.0-5 2022-01-06 [1] CRAN (R 4.2.0)
cli 3.6.1 2023-03-23 [1] CRAN (R 4.2.0)
colorspace 2.0-3 2022-02-21 [1] CRAN (R 4.2.0)
config 0.3.1 2020-12-17 [1] CRAN (R 4.2.0)
crayon 1.5.2 2022-09-29 [1] CRAN (R 4.2.0)
curl 5.0.0 2023-01-12 [1] CRAN (R 4.2.0)
data.table 1.14.2 2021-09-27 [1] CRAN (R 4.2.0)
DBI 1.1.3 2022-06-18 [1] CRAN (R 4.2.0)
dbplyr 2.3.1 2023-02-24 [1] CRAN (R 4.2.0)
devtools 2.4.5 2022-10-11 [1] CRAN (R 4.2.0)
digest 0.6.30 2022-10-18 [1] CRAN (R 4.2.0)
dplyr * 1.1.2 2023-04-20 [1] CRAN (R 4.2.0)
dtplyr 1.3.0 2023-02-24 [1] CRAN (R 4.2.0)
ellipsis 0.3.2 2021-04-29 [1] CRAN (R 4.2.0)
fansi 1.0.4 2023-01-22 [1] CRAN (R 4.2.0)
fastmap 1.1.0 2021-01-25 [1] CRAN (R 4.2.0)
forcats * 1.0.0 2023-01-29 [1] CRAN (R 4.2.0)
fs 1.6.3 2023-07-20 [1] CRAN (R 4.2.0)
generics 0.1.3 2022-07-05 [1] CRAN (R 4.2.0)
ggplot2 * 3.4.3 2023-08-14 [1] CRAN (R 4.2.0)
ggpubr 0.6.0 2023-02-10 [1] CRAN (R 4.2.0)
ggsignif 0.6.4 2022-10-13 [1] CRAN (R 4.2.0)
glue 1.6.2 2022-02-24 [1] CRAN (R 4.2.0)
gtable 0.3.1 2022-09-01 [1] CRAN (R 4.2.0)
hms 1.1.2 2022-08-19 [1] CRAN (R 4.2.0)
htmltools 0.5.6 2023-08-10 [1] CRAN (R 4.2.0)
htmlwidgets 1.6.2 2023-03-17 [1] CRAN (R 4.2.0)
httpuv 1.6.6 2022-09-08 [1] CRAN (R 4.2.0)
httr 1.4.4 2022-08-17 [1] CRAN (R 4.2.0)
httr2 0.2.2 2022-09-25 [1] CRAN (R 4.2.0)
janitor 2.1.0 2021-01-05 [1] CRAN (R 4.2.0)
jsonlite 1.8.4 2022-12-06 [1] CRAN (R 4.2.0)
keyring 1.3.1 2022-10-27 [1] CRAN (R 4.2.0)
later 1.3.0 2021-08-18 [1] CRAN (R 4.2.0)
lattice 0.20-45 2021-09-22 [1] CRAN (R 4.2.1)
lifecycle 1.0.3 2022-10-07 [1] CRAN (R 4.2.0)
lubridate * 1.9.2 2023-02-10 [1] CRAN (R 4.2.0)
magrittr 2.0.3 2022-03-30 [1] CRAN (R 4.2.0)
Matrix 1.5-1 2022-09-13 [1] CRAN (R 4.2.0)
memoise 2.0.1 2021-11-26 [1] CRAN (R 4.2.0)
mime 0.12 2021-09-28 [1] CRAN (R 4.2.0)
miniUI 0.1.1.1 2018-05-18 [1] CRAN (R 4.2.0)
munsell 0.5.0 2018-06-12 [1] CRAN (R 4.2.0)
pacman * 0.5.1 2019-03-11 [1] CRAN (R 4.2.0)
pillar 1.9.0 2023-03-22 [1] CRAN (R 4.2.0)
pkgbuild 1.3.1 2021-12-20 [1] CRAN (R 4.2.0)
pkgconfig 2.0.3 2019-09-22 [1] CRAN (R 4.2.0)
pkgload 1.3.1 2022-10-28 [1] CRAN (R 4.2.0)
png 0.1-7 2013-12-03 [1] CRAN (R 4.2.0)
prettyunits 1.1.1 2020-01-24 [1] CRAN (R 4.2.0)
processx 3.8.0 2022-10-26 [1] CRAN (R 4.2.0)
profvis 0.3.7 2020-11-02 [1] CRAN (R 4.2.0)
promises 1.2.0.1 2021-02-11 [1] CRAN (R 4.2.0)
ps 1.7.2 2022-10-26 [1] CRAN (R 4.2.0)
purrr * 1.0.2 2023-08-10 [1] CRAN (R 4.2.0)
pysparklyr * 0.1.9001 2023-10-12 [1] Github (mlverse/pysparklyr@1bdccf6)
R6 2.5.1 2021-08-19 [1] CRAN (R 4.2.0)
rappdirs 0.3.3 2021-01-31 [1] CRAN (R 4.2.0)
Rcpp 1.0.10 2023-01-22 [1] CRAN (R 4.2.0)
readr * 2.1.4 2023-02-10 [1] CRAN (R 4.2.0)
remotes 2.4.2.1 2023-07-18 [1] CRAN (R 4.2.0)
reticulate 1.32.0 2023-09-11 [1] CRAN (R 4.2.0)
rlang 1.1.1 2023-04-28 [1] CRAN (R 4.2.0)
rstatix 0.7.2 2023-02-01 [1] CRAN (R 4.2.0)
rstudioapi 0.15.0 2023-07-07 [1] CRAN (R 4.2.0)
scales 1.2.1 2022-08-20 [1] CRAN (R 4.2.0)
sessioninfo 1.2.2 2021-12-06 [1] CRAN (R 4.2.0)
shiny 1.7.2 2022-07-19 [1] CRAN (R 4.2.0)
snakecase 0.11.0 2019-05-25 [1] CRAN (R 4.2.0)
sparklyr * 1.8.3 2023-09-02 [1] CRAN (R 4.2.0)
stringi 1.7.12 2023-01-11 [1] CRAN (R 4.2.0)
stringr * 1.5.0 2022-12-02 [1] CRAN (R 4.2.0)
tibble * 3.2.1 2023-03-20 [1] CRAN (R 4.2.0)
tidyr * 1.3.0 2023-01-24 [1] CRAN (R 4.2.0)
tidyselect 1.2.0 2022-10-10 [1] CRAN (R 4.2.0)
tidyverse * 2.0.0 2023-02-22 [1] CRAN (R 4.2.0)
timechange 0.2.0 2023-01-11 [1] CRAN (R 4.2.0)
tzdb 0.4.0 2023-05-12 [1] CRAN (R 4.2.0)
urlchecker 1.0.1 2021-11-30 [1] CRAN (R 4.2.0)
usethis 2.1.6 2022-05-25 [1] CRAN (R 4.2.0)
utf8 1.2.3 2023-01-31 [1] CRAN (R 4.2.0)
vctrs 0.6.3 2023-06-14 [1] CRAN (R 4.2.0)
withr 2.5.0 2022-03-03 [1] CRAN (R 4.2.0)
xtable 1.8-4 2019-04-21 [1] CRAN (R 4.2.0)
yaml 2.3.7 2023-01-23 [1] CRAN (R 4.2.0)

[1] /Library/Frameworks/R.framework/Versions/4.2/Resources/library

─ Python configuration ──────────────────
python: /Users/my_username/.virtualenvs/r-sparklyr-databricks-14.0/bin/python
libpython: /Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/config-3.11-darwin/libpython3.11.dylib
pythonhome: /Users/my_username/.virtualenvs/r-sparklyr-databricks-14.0:/Users/my_username/.virtualenvs/r-sparklyr-databricks-14.0
version: 3.11.2 (v3.11.2:878ead1ac1, Feb 7 2023, 10:02:41) [Clang 13.0.0 (clang-1300.0.29.30)]
numpy: /Users/my_username/.virtualenvs/r-sparklyr-databricks-14.0/lib/python3.11/site-packages/numpy
numpy_version: 1.26.0
databricks: /Users/my_username/.virtualenvs/r-sparklyr-databricks-14.0/lib/python3.11/site-packages/databricks

NOTE: Python version was forced by use_python() function

Hi, what if you do: my_azure_databricks_workspace_url:[port number] as your Host URL?

Results of using my_azure_databricks_workspace_url:[port number] as the Host URL:

New error message:

✔ Using the 'r-sparklyr-databricks-14.0' Python environment (/Users/my_username/.virtualenvs/r-sparklyr-databricks-14.0/bin/python)
Error in `cluster_dbr_info()`:
! Issues connecting to Databricks. Currently using:
|-- Host: 'my_azure_databricks_workspace_url:port_number'
|-- Cluster ID: 'my_cluster_id'
|-- Token: '<REDACTED>'
Error message: "Error : Recv failure: Connection reset by peer "

A second (and subsequent) attempt yields a different error:

✔ Using the 'r-sparklyr-databricks-14.0' Python environment (/Users/my_username/.virtualenvs/r-sparklyr-databricks-14.0/bin/python)
Error in `cluster_dbr_info()`:
! Issues connecting to Databricks. Currently using:
|-- Host: 'my_azure_databricks_workspace_url:port_number'
|-- Cluster ID: 'my_cluster_id'
|-- Token: '<REDACTED>'
Error message: "Error : Empty reply from server "

Googling the error I found this stack overflow discussion, but I'm not sure where to go from here. Thank you for your help.

I'm so sorry I missed your reply, the next thing to get here is the actual code that you would use if connecting via Python with their databricks.connect library. However the Host is passed to create the new session, should be the same to use in sparklyr.