zdays15

Text Mining - How to extract insights from text

To play around with Python use a Python Distribution

Anaconda, Canopy, ... (https://wiki.python.org/moin/PythonDistributions)
Manage dependencies with virtualenv (https://virtualenv.pypa.io/)
first choice should be the Python Distribution package manager (e.g. conda) and then pip

The goal is to build a Data Pipeline which extracts data and stores in Search Engine. A Data Pipelien could contain the following steps:

data extraction - extract text from the different file format.
- data extraction with apache tika. Use tika python to extract text from different file formats
transform - Transforming unstructured data into structured data.
annotate data - use different strategies to annotate the text with metadata.
- annotate text with meta data from a external source.
- classify text - annotate text with a supervised machine learning algorithm.
- cluserting text - annotate text with a unsupervised machine learning algorithm.
store data
visualize data