An extensible framework for data mining.
spyck is a framework which aims to make it easy to develop crawlers and integrate collected data - independent of its type and origin. It's easily expandable and adaptable. It also aims to be easy to use, even for beginners.
It can be very useful for a wide variety of cases, e.g.:
- Journalist investigations to find corruption cases - like this one;
- Researching the population of a particular group;
- Better understanding of a candidate for a job before it hiring
- etc.
During the framework development some words got new meanings:
- Crawler: The data collector.
- Harvest: The execution.
- Dependencies: Required previous data.
Also, each crawler has its possible-to-achieve crop after the harvest. Each crawler works in one or more different entities, where it contextualizes and store the collected data.
Everything below can be easily installed via setuptools.
- python 3.x
- requests
- PyPDF2
- selenium
- pyslibtesseract
- aylien-apiclient
The you need to install:
- phantomJS
sudo apt-get install phantomjs
Relax, some better docs will come soon.
You can find more info about the framework - and get some feed about its development through this blog post.
You can also check the slides from a presentation made at XI Pylestras about the framework here.
- Simplify the code and make it easier to work on the development of the framework itself.
- Create a graphical interface (GUI) to make it more accessible to beginners.
- Implement analysis and inferences about the collected data.
Contributions are very welcome! If you'd like to contribute, these guidelines may help you.
See Releases for detailed changelog.
MIT License © ZETA Research.