Apache Arrow

Modifications for Intel IAA

In order to use arrow with the Intel IAA Accelerator, we need to build both arrow and QPL separately.

Arrow build instructions:

git clone https://github.com/illinoisdata/arrow-qpl.git
mv arrow-qpl arrow
pushd arrow
git submodule update --init
export PARQUET_TEST_DATA="${PWD}/cpp/submodules/parquet-testing/data"
export ARROW_TEST_DATA="${PWD}/testing/data"
popd

mkdir dist

export ARROW_HOME=$(pwd)/dist
export LD_LIBRARY_PATH=$(pwd)/dist/lib:$LD_LIBRARY_PATH
export CMAKE_PREFIX_PATH=$ARROW_HOME:$CMAKE_PREFIX_PATH

export QPL_HOME=/home/raunaks3/qpl_library
export CMAKE_PREFIX_PATH=$QPL_HOME:$CMAKE_PREFIX_PATH

mkdir arrow/cpp/build
pushd arrow/cpp/build

cmake -DCMAKE_PREFIX_PATH=$CMAKE_PREFIX_PATH \
        -DCMAKE_INSTALL_PREFIX=$ARROW_HOME \
        -DCMAKE_INSTALL_LIBDIR=lib \
        -DCMAKE_BUILD_TYPE=Debug \
        -DARROW_BUILD_TESTS=ON \
        -DARROW_COMPUTE=ON \
        -DARROW_CSV=ON \
        -DARROW_DATASET=ON \
        -DARROW_FILESYSTEM=ON \
        -DARROW_HDFS=ON \
        -DARROW_JSON=ON \
        -DARROW_PARQUET=ON \
        -DARROW_WITH_BROTLI=ON \
        -DARROW_WITH_BZ2=ON \
        -DARROW_WITH_LZ4=ON \
        -DARROW_WITH_SNAPPY=ON \
        -DARROW_WITH_ZLIB=ON \
        -DARROW_WITH_ZSTD=ON \
        -DARROW_WITH_QPL=ON \
        -DPARQUET_REQUIRE_ENCRYPTION=ON \
        -DARROW_EXTRA_ERROR_CONTEXT="ON" \
        ..

make -j8
sudo make install
popd

If you want to use python as well:

python3 -m venv pyarrow-dev
source ./pyarrow-dev/bin/activate
pip install -r arrow/python/requirements-build.txt
pip install ipykernel

pushd arrow/python
export PYARROW_WITH_PARQUET=1
export PYARROW_WITH_DATASET=1
export PYARROW_WITH_SNAPPY=1
export PYARROW_WITH_ZLIB=1
export PYARROW_WITH_QPL=1
export PYARROW_WITH_ZSTD=1
export PYARROW_WITH_BZ2=1
export PYARROW_WITH_BROTLI=1
export PYARROW_WITH_LZ4=1
export PYARROW_WITH_HDFS=1
export PYARROW_WITH_CSV=1
export PYARROW_WITH_JSON=1
export PYARROW_PARALLEL=8
export PYARROW_WITH_PARQUET_ENCRYPTION=1
python setup.py build_ext --inplace
popd

For building QPL,

git clone --recursive https://github.com/intel/qpl.git ./qpl_library
cd qpl_library
mkdir build
cd build

mkdir ../qpl_installation
cmake -DCMAKE_BUILD_TYPE=Debug -DCMAKE_INSTALL_PREFIX=../qpl_installation ..
cmake --build . --target install

# To configure the IAA device (in case we are using hardware path):
sudo python3 /home/<USER>/qpl_library/qpl_installation/share/QPL/scripts/accel_conf.py --load=/home/<USER>/qpl_library/qpl_installation/share/QPL/configs/1n1d1e1w-s-n2.conf

Testing Before testing the arrow-qpl integration, it makes sense to test whether qpl runs normally on its own. You can do this by running:

cd ~/qpl_library/examples/low-level-api
g++ -I/home/raunaks3/qpl_library/qpl_installation/include -o compression_example compression_example.cpp /home/raunaks3/qpl_library/qpl_installation/lib/libqpl.a -ldl
sudo ./compression_example software_path

Any issues in the above step need to be fixed before moving forward.

Now, the normal testing file is arrow/cpp/examples/parquet/parquet_arrow/reader-writer.cc. It creates a table, writes it to disk as a parquet file using compression with QPL, and then reads and decompresses the file (also using QPL). Currently this is working with both the software path (no accelerator) and hardware path (IAA accelerator).

To test and run (note that if we change any source code in the main arrow repository we need to rebuild arrow before running the following):

cd arrow/cpp/examples/parquet/parquet_arrow
mkdir qpl_build
cd qpl_build
cmake ..
make
./parquet-arrow-example

Testing compression/decompression performance on TPCH data:

Details are given in /home/raunaks3/arrow/python/TPCH_README.md

What's done

Compression/Decompression with QPL. The new compression codec is in cpp/src/arrow/util/compression_qpl.cc and any relevant files in the repository have been modified accordingly. Testing details are given above.
Testing performance speedup on TPCH data for compression/decompression
Loading TPCH data in C++ for the filtering step

Work in progress -

Testing filtering speedup in QPL compared to arrow

---------------- Original Apache Arrow README continues from this point onwards ----------------

Powering In-Memory Analytics

Apache Arrow is a development platform for in-memory analytics. It contains a set of technologies that enable big data systems to process and move data fast.

Major components of the project include:

The Arrow Columnar In-Memory Format: a standard and efficient in-memory representation of various datatypes, plain or nested
The Arrow IPC Format: an efficient serialization of the Arrow format and associated metadata, for communication between processes and heterogeneous environments
The Arrow Flight RPC protocol: based on the Arrow IPC format, a building block for remote services exchanging Arrow data with application-defined semantics (for example a storage server or a database)
C++ libraries
C bindings using GLib
C# .NET libraries
Gandiva: an LLVM-based Arrow expression compiler, part of the C++ codebase
Go libraries
Java libraries
JavaScript libraries
Python libraries
R libraries
Ruby libraries
Rust libraries

Arrow is an Apache Software Foundation project. Learn more at arrow.apache.org.

What's in the Arrow libraries?

The reference Arrow libraries contain many distinct software components:

Columnar vector and table-like containers (similar to data frames) supporting flat or nested types
Fast, language agnostic metadata messaging layer (using Google's Flatbuffers library)
Reference-counted off-heap buffer memory management, for zero-copy memory sharing and handling memory-mapped files
IO interfaces to local and remote filesystems
Self-describing binary wire formats (streaming and batch/file-like) for remote procedure calls (RPC) and interprocess communication (IPC)
Integration tests for verifying binary compatibility between the implementations (e.g. sending data from Java to C++)
Conversions to and from other in-memory data structures
Readers and writers for various widely-used file formats (such as Parquet, CSV)

Implementation status

The official Arrow libraries in this repository are in different stages of implementing the Arrow format and related features. See our current feature matrix on git main.

How to Contribute

Please read our latest project contribution guide.

Getting involved

Even if you do not plan to contribute to Apache Arrow itself or Arrow integrations in other projects, we'd be happy to have you involved:

Join the mailing list: send an email to dev-subscribe@arrow.apache.org. Share your ideas and use cases for the project.
Follow our activity on GitHub issues
Learn the format
Contribute code to one of the reference implementations

raunaks13 / arrow-qpl