davideanastasia / apache-beam-getting-started

Getting Started with Apache Beam: inverted index

Home Page:https://medium.com/@davide.anastasia/getting-started-with-apache-beam-26bfc5126438

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Getting Started with Apache Beam

This is 3-2-1-go project on how to get started with Apache Beam.

Inverted Index

More on this on Medium: https://medium.com/@davide.anastasia/getting-started-with-apache-beam-26bfc5126438

The idea behind this simple batch job is to create an inverted index: given a set of documents in text format, the job will parse and build a word -> location mapping for each of the words. The job is an interesting toy, as it shows how:

  • read data + file name (slightly different than using TextIO)
  • filter out common stop words (in a very naive way, but more interesting ways can be found!)
  • create a CombineFn in order to avoid streaming all the data for a single key to a single node

References

About

Getting Started with Apache Beam: inverted index

https://medium.com/@davide.anastasia/getting-started-with-apache-beam-26bfc5126438


Languages

Language:Java 97.2%Language:Makefile 1.6%Language:Shell 1.1%