michelou / spark-examples

Playing with Spark on Windows

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Playing with Spark on Windows

Spark project This repository gathers Spark code examples coming from various websites and books.
It also includes several build scripts (Bash scripts, batch files, Make scripts) for experimenting with Spark on a Windows machine.

Read the document "What is Apache Spark™?" from the Spark documentation to know more about the Spark ecosystem.

Ada, Akka, C++, Dart, Deno, Docker, Flix, Golang, GraalVM, Haskell, Kafka, Kotlin, LLVM, Modula-2, Node.js, Rust, Scala 3, Spring, TruffleSqueak and WiX Toolset are other trending topics we are continuously monitoring.

Project dependencies

This project depends on two external software for the Microsoft Windows platform:

Optionally one may also install the following software:

Installation policy
When possible we install software from a Zip archive rather than via a Windows installer. In our case we defined C:\opt\ as the installation directory for optional software tools (similar to the /opt/ directory on Unix).

For instance our development environment looks as follows (March 2024) 2:

C:\opt\apache-maven\                       ( 10 MB)
C:\opt\Git\                                (367 MB)
C:\opt\gradle\                             (140 MB)
C:\opt\jdk-temurin-11.0.22_7\              (303 MB)
C:\opt\jdk-temurin-17.0.10_7\              (301 MB)
C:\opt\jdk-temurin-21.0.2_13\              (326 MB)
C:\opt\sbt\                                (135 MB)
C:\opt\scala-2.13.12\                      ( 24 MB)
C:\opt\spark-3.5.1-bin-hadoop3\            (423 MB)
C:\opt\spark-3.5.1-bin-hadoop3-scala2.13\  (432 MB)
C:\opt\VSCode\                             (352 MB)

🔎 Git for Windows provides a BASH emulation used to run git from the command line (as well as over 250 Unix commands like awk, diff, file, grep, more, mv, rmdir, sed and wc).

Directory structure

This project is organized as follows:

bin\
docs\
examples\{README.md, HelloWorld, etc.}
README.md
QUICKREF.md
RESOURCES.md
setenv.bat

where

We also define a virtual drive – e.g. drive K: – in our working environment in order to reduce/hide the real path of our project directory (see article "Windows command prompt limitation" from Microsoft Support).

🔎 We use the Windows external command subst to create virtual drives; for instance:

> subst K: %USERPROFILE%\workspace\spark-examples

In the next section we give a brief description of the batch files present in this project.

Batch/Bash commands

setenv.bat 3

We execute command setenv.bat once to setup our development environment; it makes external tools such as mvn.cmd, sbt.bat or sh.exe directly available from the command prompt.

> setenv
Tool versions:
   java 11.0.22, sbt 1.9.9, scalac 2.13.12, spark-shell 3.5.1,
   gradle 8.6, mvn 3.9.6,
   git 2.44.0.windows.1, diff 3.10, bash 5.2.26(1)-release

> where mvn sbt sh
C:\opt\apache-maven\bin\mvn
C:\opt\apache-maven\bin\mvn.cmd
C:\opt\Git\bin\sh.exe
C:\opt\Git\usr\bin\sh.exe
C:\opt\sbt\bin\sbt
C:\opt\sbt\bin\sbt.bat

Footnotes

[1] Scala 2.13 Support

Spark 3.2.0 and newer add support for Scala 2.13 (see PR#34218).

[2] Downloads

In our case we downloaded the following installation files (see section 1):
apache-maven-3.9.6-bin.zip                         ( 10 MB)
gradle-8.6-bin.zip                                 (118 MB)
OpenJDK11U-jdk_x64_windows_hotspot_11.0.22_7.zip   (194 MB)
OpenJDK17U-jdk_x64_windows_hotspot_17.0.10_7.zip   (191 MB)
OpenJDK21U-jdk_x64_windows_hotspot_21.0.2_13.zip   (191 MB)
PortableGit-2.44.0-64-bit.7z.exe                   ( 41 MB)
sbt-1.9.9.zip                                      ( 17 MB)
scala-2.13.12.zip                                  ( 21 MB)
spark-3.5.1-bin-hadoop3.tgz                        (285 MB)
spark-3.5.1-bin-hadoop3-scala2.13.tgz              (292 MB)

[3] setenv.bat usage

Batch file setenv.bat has specific environment variables set that enable us to use command-line developer tools more easily.
It is similar to the setup scripts described on the page "Visual Studio Developer Command Prompt and Developer PowerShell" of the Visual Studio online documentation.
For instance we can quickly check that the two scripts Launch-VsDevShell.ps1 and VsDevCmd.bat are indeed available in our Visual Studio 2019 installation :
> where /r "C:\Program Files (x86)\Microsoft Visual Studio" *vsdev*
C:\Program Files (x86)\Microsoft Visual Studio\2019\Community\Common7\Tools\Launch-VsDevShell.ps1
C:\Program Files (x86)\Microsoft Visual Studio\2019\Community\Common7\Tools\VsDevCmd.bat
C:\Program Files (x86)\Microsoft Visual Studio\2019\Community\Common7\Tools\vsdevcmd\core\vsdevcmd_end.bat
C:\Program Files (x86)\Microsoft Visual Studio\2019\Community\Common7\Tools\vsdevcmd\core\vsdevcmd_start.bat
Concretely, in our GitHub projects which depend on Visual Studio (e.g. michelou/cpp-examples), setenv.bat does invoke VsDevCmd.bat (resp. vcvarall.bat for older Visual Studio versions) to setup the Visual Studio tools on the command prompt.

mics/March 2024  

About

Playing with Spark on Windows