gabrielsscavalcante/spyck

An extensible framework for data mining.

Purpose

spyck is a framework which aims to make it easy to develop crawlers and integrate collected data - independent of its type and origin. It's easily expandable and adaptable. It also aims to be easy to use, even for beginners.

It can be very useful for a wide variety of cases, e.g.:

Journalist investigations to find corruption cases - like this one;
Researching the population of a particular group;
Better understanding of a candidate for a job before it hiring
etc.

Concepts

During the framework development some words got new meanings:

Crawler: The data collector.
Harvest: The execution.
Dependencies: Required previous data.

Also, each crawler has its possible-to-achieve crop after the harvest. Each crawler works in one or more different entities, where it contextualizes and store the collected data.

Requirements

Everything below can be easily installed via setuptools.

python 3.x
requests
PyPDF2
selenium
pyslibtesseract
aylien-apiclient

The you need to install:

phantomJS

sudo apt-get install phantomjs

Other Resources

Relax, some better docs will come soon.

You can find more info about the framework - and get some feed about its development through this blog post.

You can also check the slides from a presentation made at XI Pylestras about the framework here.

Roadmap

Simplify the code and make it easier to work on the development of the framework itself.
Create a graphical interface (GUI) to make it more accessible to beginners.
Implement analysis and inferences about the collected data.

Contributing

Contributions are very welcome! If you'd like to contribute, these guidelines may help you.

History

See Releases for detailed changelog.

License

About

Framework extensível para mineração de dados

http://zetaresearch.github.io/projects/spyck

MIT License

Languages

Language:Python 100.0%

gabrielsscavalcante / spyck

Table of Contents