To play around with Python use a Python Distribution
- Anaconda, Canopy, ... (https://wiki.python.org/moin/PythonDistributions)
- Manage dependencies with virtualenv (https://virtualenv.pypa.io/)
- first choice should be the Python Distribution package manager (e.g. conda) and then pip
The goal is to build a Data Pipeline which extracts data and stores in Search Engine. A Data Pipelien could contain the following steps:
- data extraction - extract text from the different file format.
- data extraction with apache tika. Use tika python to extract text from different file formats
- transform - Transforming unstructured data into structured data.
- annotate data - use different strategies to annotate the text with metadata.
- annotate text with meta data from a external source.
- classify text - annotate text with a supervised machine learning algorithm.
- cluserting text - annotate text with a unsupervised machine learning algorithm.
- store data
- visualize data