marinoandrea / disaster-tweets-brane

NLP pipeline for the Disaster Tweets Kaggle competition using the Brane framework.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Natural Language Processing with Disaster Tweets

DOI Compute Package Utils Package Visualization Package

Introduction

This project features an implementation of an NLP pipeline for the disaster tweets Kaggle competition using the Brane framework. The implementation is divided into the following Brane packages which can be imported individually and used in other workflows: compute, visualization, and utils.

  • compute exposes utilities for preprocessing data, training a classifier, and generating a valid submission file for the challenge.
  • visualization provides functions to generate plots and charts based on the dataset.
  • utils contains generic utility functions and specifically allows for downloading the dataset at runtime.

We also include a github.yml specification which defines an OpenAPI container that exposes a function to download arbitrary files from GitHub repositories.

Build

Each package can be individually imported with the following command:

brane import marinoandrea/disaster-tweets-brane -c packages/<PACKAGE_NAME>

However, we also provide a shell script for convenience. The user can clone the repository and simply run ./build.sh all to build all of our packages. Additionally, you also can run the following commands to build a specific package.

# build the computation package
./build.sh compute
# build the visualization package
./build.sh visualization
# build the utils package
./build.sh utils

Of course, you can always navigate to the package directory and run the following command to build the brane package.

brane build container.yml

Run

Our pipeline implementation can be executed locally or on a multi-node Kubernetes cluster by simply running the following command in the root folder of the project:

brane run -d <DFS_FOLDER> pipeline.bs

The following picture shows an example that our package uses the pipeline.bs to run the whole pipeline in the Kubernetes cluster. Example Runs On Kubernetes cluster

NOTE: Brane may print some warnings about serialization issues to the console. However, the pipeline can run till the end without issues.

About

NLP pipeline for the Disaster Tweets Kaggle competition using the Brane framework.


Languages

Language:Python 92.8%Language:HTML 5.6%Language:Shell 1.6%