ethan-homan / FlashTextSpark

Spark wrapper around jasonsperske's Java port of flashtext.py

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

FlashTextSpark

Introduces SparkKeywordProcessor which is a thin Scala wrapper around the FlashTextJava library done by jasonsperske. That project was a port of the flashtext.py into Java.

The motivation for this was to run FlashText on Spark to efficiently tag milliions of unstructured documents for matches against a large corpus of keywords (also in the millions).

Building

Just clone the repo an if you are on UNIX:

./gradlew build

or on windows:

./gradlew.bat build

This will bootstrap the project with all the dependencies, just requiring java 8 to be installed.

About

Spark wrapper around jasonsperske's Java port of flashtext.py

License:MIT License


Languages

Language:Java 82.5%Language:Scala 17.5%