rueedlinger / zdays15

zdays

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

zdays15

Text Mining - How to extract insights from text

To play around with Python use a Python Distribution

Data Pipeline steps

The goal is to build a Data Pipeline which extracts data and stores in Search Engine. A Data Pipelien could contain the following steps:

  • data extraction - extract text from the different file format.
    • data extraction with apache tika. Use tika python to extract text from different file formats
  • transform - Transforming unstructured data into structured data.
  • annotate data - use different strategies to annotate the text with metadata.
    • annotate text with meta data from a external source.
    • classify text - annotate text with a supervised machine learning algorithm.
    • cluserting text - annotate text with a unsupervised machine learning algorithm.
  • store data
  • visualize data

About

zdays