easeq / fasttext_inference

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

fastText inference with PySpark

Overview

This repo. shows how to perform inference using a fastText model with PySpark via user defined functions (UDF's), and via RDD's mapPartitions.

For this illustrative example, we consider we have used the stacksample data for the use case where given (short) text of questions titles, we want to predict their most probable tags. For more info. regarding the data, the processing, and the training stage, you can refer to this tutorial.

Here we assume we have:

  • An already trained fastText model file.
  • A Parquet file with only text input, with already proper processing and cleaning, ready for inference.

Please note that you can substitute these with another model file and another Parquet file, regarding other use cases. In short, the aim of the repo. is to show alternatives to perform scalable inference on the Parquet file using a fastText model file through PySpark.

Requirements

Make sure you have Docker installed in your machine. The Docker base image is this, which has a standard installation of miniconda (based on Python 3.7). As you will be able to see from the Dockerfile, the following tools and libraries are installed when building the image.

Tools:

Python libraries:

Usage

Once you have chosen a proper working directory on your local machine, clone this repo. and go inside the repo. folder.

git clone https://github.com/pjcv89/fasttext_inference.git
cd fasttext_inference 

Now, build the image using the provided Dockerfile, and give it a name and a tag. For example:

docker image build -t fasttext_inference:0.1 .

Once you have built the image, you can use the container in two ways.

a) Using the command line

You can execute:

docker run --name inference -v $PWD:/fasttext_inference -it --entrypoint=/bin/bash fasttext_inference:0.1

In this mode, you will be able to invoke spark-submit to execute the inference.py and inference_mapp.py scripts from the command line and from the current working directory.

The purpose of both scripts is to take a Parquet file ready for inference specified via the --input-file argument and produce an output Parquet file with predictions, whose location needs to be specified via the --output-file argument.

In both scripts we can choose between:

  • Retrieving single predictions (the prediction with highest probability for each instance, whenever its probability is above a certain threshold).
  • Retrieving multiple predictions (those predictions whose probabilities are above a certain threshold, retrieving at most k predictions for each instance) via the --multi-pred flag.

Note that in either case, if the threshold condition is not met, a null value is returned. Currently the default values of the threshold and k are set to threshold=0.10 and k=3.

Using UDF's approach (inference.py script)

In this script we can choose between:

  • Using Spark UDF's (one-row-at-a-time execution of UDF, this is the default behavior in the script).
  • Using Pandas UDF's for PySpark (execution of UDF by chunks of pandas.Series), which are built on top of Arrow, via the --use-arrow flag.

For example, launching the following job will use the standard UDF's approach and retrieve single predictions:

spark-submit inference.py --input-file data/input.parquet --output-file data/output.parquet

While launching the following job will use Pandas UDF's approach and retrieve multiple predictions instead:

spark-submit inference.py --input-file data/input.parquet --output-file data/output.parquet --use-arrow --multi-pred
Using RDD's mapPartitions approach (inference_mapp.py script)

This approach is inspired by this discussion and follows a different logic by using the powerful Spark's mapPartitions transformation.

For example, launching the following job will use the RDD's mapPartitions approach and retrieve multiple predictions:

spark-submit inference_mapp.py --input-file data/input.parquet --output-file data/output.parquet --multi-pred

b) Using the Jupyter notebooks

You can execute:

docker run --name inference -p 8080:8888 -v $PWD:/fasttext_inference fasttext_inference:0.1

Jupyter will be launched and you can go to http://localhost:8080/. You should copy the token displayed in the command line and paste it in the jupyter welcome page. You will be able to see the files contained in this repo., including the /notebooks folder, which you can open to start executing the notebooks.

Files and folders

The following files are provided:

  • Dockerfile: The Dockerfile to build the image.
  • classifier.py: Python file with required functions to construct the UDF's and called by the inference.py Python script.
  • inference.py: Python script relative to the UDF's approach, to be executed via spark-submit.
  • inference_mapp.py: Python script relative to the RDD's mapPartitions approach, to be executed via spark-submit.

The following folders are present:

  • /data: It contains the test and test_unlabeled text files, where the latter is just the unlabeled version of the former. It also contains /input.parquet folder where the input Parquet file built from test_unlabeled and ready for inference is located, and /output.parquetfolder where the output Parquet file with predictions will be persisted after executing any of the Python scripts.
  • /models: It contains the already trained fastText model, called ft_tuned.ftz
  • /notebooks: It contains the following notebooks, which contain some prototyping code for the Python scripts and some performance tests. Names are self-explanatory.
  1. 00_Input_Data.ipynb: Notebook that shows how the input Parquet file was generated. You can view it here.
  2. 01_Standard_UDFs.ipynb: View it here.
  3. 02_Pandas_UDFs.ipynb: View it here.
  4. 03_RDDs_mapPartitions.ipynb: View it here.

BONUS: Using RDD's pipe approach

This approach is also inspired by this discussion and uses Spark's pipe method to call external processes. In this case, we use fastText CLI tool to get predictions using a shell script to be called within the pipe method.

The /pipe folder includes the following files:

  • 04_RDDs_pipe.ipynb: Notebook that shows how to carry out this approach. View it here.
  • install_fasttext.sh: Shell script to build fastText from source and install CLI tool. Used in the notebook.
  • get_predictions.sh: Shell script to be called within the pipe method. Used in the notebook.

Please refer to the following posts:

  1. Spark pipe: A one-pipe problem
  2. Pipe in Spark

Important note: Distributed settings

Please note that all examples here use Spark's local mode and client mode. For the UDF's approach shown here, in order to make the model file and the Python module available among workers, we have included the following lines in the inference.py script:

spark.sparkContext.addFile('models/ft_tuned.ftz')
spark.sparkContext.addPyFile('./classifier.py')

However, in a distributed setting and in cluster mode, we would need to distribute these files across nodes using the files and --py-files options instead. See this question.

Resources

Engineering

  • Ideas used in this repo.
  1. Classifying text with fastText in pySpark
  2. Prediction at Scale with scikit-learn and PySpark Pandas UDF's
  3. Introducing Pandas UDF's for PySpark
  4. Pandas user-defined functions
  5. PySpark Usage Guide for Pandas with Apache Arrow

Science

  • fastText related papers:
  1. Bag of Tricks for Efficient Text Classification
  2. Enriching Word Vectors with Subword Information
  3. FastText.zip: Compressing text classification models
  4. Misspelling Oblivious Word Embeddings
  • Papers about techniques used in fastText to improve scalability and training time:
  1. Hierarchical softmax based on the Huffman coding tree
  2. The hashing trick

About


Languages

Language:Jupyter Notebook 96.8%Language:Python 2.9%Language:Dockerfile 0.2%Language:Shell 0.2%