Streaming Machine Translation on Apache Flink pipelines leveraging Apache OpenNLP
The NMT model has been trained offline using a parallel WMT corpus for German and English, uses Byte-Pair Encoding (BPE) to account for unseen vocabulary. The model has been dockered into Sagemaker as a Flask app - try something new.
It takes > 6 mins to create a Sagemaker endpoint, all Sagemaker model training is batch only, and also that Python is the primary language of Sagemaker - its best to use Sagemaker for model inference when invoked from streaming pipelines like Kafka Streams and Flink. For streaming model training - consider looking at --> Oryx2, PredictionIO or Flink's State Streaming.
Building the Project
- Fill twitter.properties with Twitter Developer OAuth Creds
- Deploy the bv model into Sagemaker as a Docker container - get that here --> https://amazon.awsapps.com/workdocs/index.html#/folder/e811a3ed006b7bb0b88d46b6010d1d232c21f8f69030dfa56133ee7918bb18fa See the Readme file there for instructions on Sagemaker deployment
- Specify the AWS Sagemaker endpoint and region in aws.properties make sure you have your AWS creds locally
- Run mvn package to generate a project jar
- Start a Flink cluster - see https://ci.apache.org/projects/flink/flink-docs-release-1.6/quickstart/setup_quickstart.html
- From the Flink Dashboard UI - upload the jar generated in Step 2 and submit the same Main class = de.dws.berlin.StreamingNmt