NOTE: This repo is not up-to-date.

Schema Evolution Exploration

These notebooks were created to test out how schema evolution works in both Spark and Presto.

There are basically six modes of reading parquet for our use case:

Spark
Spark with mergeSchema = True
parquet2hive + Presto old parquet reader (hive.parquet-optimized-reader.enabled=false) ⋅⋅1. Using column index (hive.parquet.use-column-names=false) ⋅⋅2. Using column names (hive.parquet.use-column-names=true)
parquet2hive + Presto optimized parquet reader (hive.parquet-optimized-reader.enabled=true) ⋅⋅1. Using column index (hive.parquet.use-column-names=false) ⋅⋅2. Using column names (hive.parquet.use-column-names=true)

For each scenario, I've tested all of these modes. The scenarios I've covered are:

Adding a column
Removing a column
Renaming a column
Replacing a column (new type)
Transposing columns
Transposing with add and remove
Adding a new subcolumn to a nested Row type
Removing a subcolumn from a nested Row type

Note that modes 1 and 2 are repeated in both notebooks. Any discrepencies should be pointed out immediately.

Running the Notebook

You'll need parquet2hive_server running on the machine with the Hive metastore (probably Presto cluster master, unless you've done some config, in which case you already know what you're doing).

sudo pip install parquet2hive_server
start_parquet2hive_server

Make sure the notebook has the correct DNS entry for the associated Hive metastore in the ipynb notebook.

Then you need to run the notebook. These notebooks DO error, as it is expected behavior - this halts Jupyter running the cells sequentially. You can either run them manually, or run the following to get the output notebook:

git clone https://github.com/fbertsch/schema_evolution_exploration.git
cd parquet-optimized-reader\=true/use-column-names\=true/ # go to the dir with appropriate presto config options
export NOTEBOOK_NAME="Schema_Evolution_Exploration.ipynb"
rm $NOTEBOOK_NAME
export PYSPARK_DRIVER_PYTHON=jupyter
PYSPARK_DRIVER_PYTHON_OPTS="nbconvert --ExecutePreprocessor.timeout=-1 --ExecutePreprocessor.kernel_name=python2 --to notebook --log-level=10 --execute \"../../${NOTEBOOK_NAME}\" --allow-errors --output-dir ./ " \
pyspark

Then the resulting notebook, with errors, will be produced in the ~/analyses/output directory.

Setting Presto Options

By default, Presto uses column indices to map column names. This way, even if a column name changes, the earlier data will still be read since it is mapping instead to an index. To change this behavior and instead always map to column names (the ones defined in the parquet schema), set hive.parquet.use-column-names=true.

Presto also by default uses an outdated parquet reader. Set hive.parquet-optimized-reader.enabled=true to use the new reader, with different schema evolution configurations.

echo 'hive.parquet.use-column-names=true' | sudo tee -a /etc/presto/conf.dist/catalog/hive.properties
echo 'hive.parquet-optimized-reader.enabled=true' | sudo tee -a /etc/presto/conf.dist/catalog/hive.properties
sudo restart presto-server

fbertsch / schema_evolution_exploration

NOTE: This repo is not up-to-date.

Schema Evolution Exploration

Running the Notebook

Setting Presto Options

About

Languages