本リポジトリの目的 Purpose ot this repository

Pyspark（Jupyter Notebook）をローカル環境（スタンドアロン）で試してみるために作成した
Pyspark（Jupyter Notebook）学習環境として利用していただければ、幸いです

This repository is to try out Pyspark (Jupyter Notebook) in a local (stand alone) environment.
I hope you can use it as a Pyspark (Jupyter Notebook) learning environment.

実行イメージ

Visual Studio Codeでの実行イメージ

localhost上での結果出力イメージ

localhost:4040

ディレクトリ構成

.
├── LICENSE
├── README.md
├── csv                                    サンプルデータ格納場所
├── images                                 README用のイメージファイル
├── json                                   サンプルデータ格納場所
├── nb_csv_products_upsert.ipynb           UPSERTサンプル
├── nb_csv_vehicles_add_deleteflg.ipynb    物理削除&delete_flog付与サンプル
├── nb_csv_vehicles_phisical_delete.ipynb  物理削除サンプル
├── nb_json_flatten_customers.ipynb        json展開サンプル
├── nb_json_noflatten_customers.ipynb      jsonサンプル
├── packages                               Pyspark用自作モジュール
│   └── modules.py
└── tables                                 Delta tables
    ├── customers_flatten                  jsonのflattenしたデータをDelta tableにINSERTするサンプル
    │   ├── _delta_log
    │   │   ├── 00000000000000000000.json
    │   │   └── 00000000000000000001.json
    │   ├── address_state=三重県
    │   │   ├── part-hoge.snappy.parquet
    │   │   └── part-hoge.snappy.parquet
    │   └── ...
    ├── customers_noflatten                jsonデータをDelta tableにINSERTするサンプル
    │   ├── _delta_log
    │   │   ├── 00000000000000000000.json
    │   │   └── 00000000000000000001.json
    │   ├── zodiacSign=Aquarius
    │   │   ├── part-hoge.snappy.parquet
    │   │   └── part-hoge.snappy.parquet
    │   └── ...
    ├── products                           UPSERTのサンプル
    │   ├── _delta_log
    │   │   ├── 00000000000000000000.json
    │   │   └── 00000000000000000001.json
    │   ├── kind=Electronic
    │   │   ├── part-hoge.snappy.parquet
    │   │   └── part-hoge.snappy.parquet
    │   └── ...
    ├── products_tmp                       UPSERT用の一時テーブルサンプル
    │   ├── _delta_log
    │   │   ├── 00000000000000000000.json
    │   │   └── 00000000000000000001.json
    │   ├── kind=Electronic
    │   │   └── part-hoge.snappy.parquet
    │   └── kind=Unbranded
    │       └── part-hoge.snappy.parquet
    ├── vehicles                           ソースデータに存在しないレコードを物理削除するサンプル
    │   ├── _delta_log
    │   │   ├── 00000000000000000000.json
    │   │   └── 00000000000000000001.json
    │   ├── type=Coupe
    │   │   ├── part-hoge.snappy.parquet
    │   │   └── part-hoge.snappy.parquet
    │   └── ...
    ├── vehicles_add_deleteflg             ソースデータに存在しないレコードをdelete_flgを付与し論理削除するサンプル
    │   ├── _delta_log
    │   │   ├── 00000000000000000000.json
    │   │   └── 00000000000000000001.json
    │   ├── type=Coupe
    │   │   ├── part-hoge.snappy.parquet
    │   │   └── part-hoge.snappy.parquet
    │   └── ...
    ├── vehicles_add_deleteflg_tmp         ソースデータに存在しないレコードをdelete_flgを付与し論理削除する一時テーブルサンプル
    │   ├── _delta_log
    │   │   ├── 00000000000000000000.json
    │   │   └── 00000000000000000001.json
    │   ├── type=Coupe
    │   │   ├── part-hoge.snappy.parquet
    │   │   └── part-hoge.snappy.parquet
    │   └── ...
    └── vehicles_tmp                       ソースデータに存在しないレコードを物理削除する一時テーブルサンプル
        ├── _delta_log
        │   └── 00000000000000000000.json
        ├── type=1
        │   └── part-hoge.snappy.parquet
        ├── type=Coupe
        │   └── part-hoge.snappy.parquet
        ├── type=Hatchback
        │   └── part-hoge.snappy.parquet
        └── type=Wagon
            └── part-hoge.snappy.parquet

環境構築について About environment construction

Devcontainerで環境構築を行なった
※Devcontainerについては、Developing inside a Container等を参考にすること

The environment is built with Devcontainer.
For more information about Devcontainer, Please refer to Developing inside a Container.

前提条件 Prerequisites

ローカルPC上に以下のソフトウェアをインストールしておくこと
You must install these softwares in your compute:

また、Visual Studio Codeに以下の拡張機能をインストールしておくこと
Also, the following extensions should be installed in Visual Studio Code

Dev Containers

事前準備 Preparetion

Apache SparkはJavaで動いている
なので、Javaのインストール後、環境変数JAVA_HOMEを設定する必要がある
.devcontainer/devcontainer.envを用意し、そこに以下の記述をする必要がある
既存ファイル.devcontainer/devcontainer_template.envを利用すると良い
執筆時点では、以下である

Apache Spark runs in Java.
Therefore, it is necessary to set the environment variable JAVA_HOME after installing Java.
Prepare .devcontainer/devcontainer.env, and it is necessary to describe the following in it.
You can use the existing file .devcontainer/devcontainer_template.env
At the time of writing, it is as follows:

JAVA_HOME=/usr/lib/jvm/java-17-openjdk-arm64

Apache Sparkが動かない時は、この設定を見直す
調べ方は以下
If Apache Spark doesn't work, edit this setting
Here is how to find out.

sudo update-alternatives --config java

There is 1 choice for the alternative java (providing /usr/bin/java).

  Selection    Path                                         Priority   Status
------------------------------------------------------------
* 0            /usr/lib/jvm/java-17-openjdk-amd64/bin/java   1711      auto mode
  1            /usr/lib/jvm/java-17-openjdk-amd64/bin/java   1711      manual mode

Press <enter> to keep the current choice[*], or type selection number:

ローカル環境を立ち上げる Set up a local environment

Docker Desktopを立ち上げておく
Launch Docker Desktop

Visual Studio Codeの左下のアイコンを押下する
Click on the icon in the lower left coner.
「コンテナーで開く」を押下する
Reopen in Container

ビルドが始まるのをしばらく待つ
wait for build.

Notebookを実行してみる Run notebook

ビルドが終わったら、拡張子ipynbのファイルを開いて、実行等をしてみる
After the build, open the file with the extension ipynb and run it, etc.

実行結果をlocalhost:4040で閲覧できる
ログを確認したり、エラー内容を確認できたりする

Execution results can be viewed in localhost:4040
Check the logs and the errors.

Devcontainerを終了する Finish Devcontainer

Visual Studio Codeの左下のアイコンを押下する
Click on the icon in the lower left coner.
「リモート接続を終了する」を押下する
Close Remote connection

注意事項

一時テーブルについて

アンマネージドテーブルはデータの物理削除ができない
マネージドテーブルはデータの物理削除ができる
※参考情報は下記参照
Unmanaged tables cannot physically delete data
Managed tables allow physical deletion of data
※See below for reference information

テストデータについて

./json配下にあるデータはテストデータでFakerで作成したもので、人物の情報は全て架空である

The data under the directory ./json is test data created by Faker, and all the person information is fictitious.

java.lang.OutOfMemoryError: Java heap spaceについて

テストデータの容量が大きすぎるとjava.lang.OutOfMemoryError: Java heap spaceで処理が止まる
あくまでも検証用の環境なので、あまり大量のデータで処理をさせないように注意すること

Processing stops with error java.lang.OutOfMemoryError: Java heap space, when test data size is too huge.
Be careful not to process too huge data since this is only a verification environment.

参考情報

デルタテーブルに書き込めない場合の対処法

Koalas / pyspark Failed to find data source: delta

JAVA_HOMEを環境変数に設定する

jsonをflattenする

Azure Synapse Analytics で複合データ型を分析する

その他の役立つTips

Data

拡張機能 parquet-viewerについて

parquet-viewerのデフォルトの設定だと、parquetを閲覧するときに、以下のようなエラーが発生して、parquetを閲覧できないので要注意

Note that with the default configuration of parquet-viewer, the following error occurs when viewing parquet, and parquet cannot be viewed.

{
  "error": "while reading /workspace/tables/customers_flatten/address_state=愛知県/part-00000-a330616d-9cc0-493a-ac8f-d3d1bed5663d.c000.snappy.parquet: Error: cannot find prebuilt arrow module, either build the module or use another backend: Error: Failed to find binding for arrow-parquet-reader"
}

故に、.vscode/settings.jsonの設定内容を以下のように変更している

Therefore, the settings in .vscode/settings.json are changed as follows:

# before
{
  "parquet-viewer.backend": "arrow"
}

# after
{
  "parquet-viewer.backend": "parquets"
}

About

Pyspark（Jupyter Notebook）をDevcontainerで構築しています。直ぐにPyspark（Jupyter Notebook）を試すことができます。

MIT License

Languages

Language:Jupyter Notebook 99.2%Language:Dockerfile 0.5%Language:Python 0.3%