gabrielsscavalcante / spyck

Framework extensível para mineração de dados

Home Page:http://zetaresearch.github.io/projects/spyck

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Logo

An extensible framework for data mining.

License

Table of Contents

Purpose

spyck is a framework which aims to make it easy to develop crawlers and integrate collected data - independent of its type and origin. It's easily expandable and adaptable. It also aims to be easy to use, even for beginners.

It can be very useful for a wide variety of cases, e.g.:

  • Journalist investigations to find corruption cases - like this one;
  • Researching the population of a particular group;
  • Better understanding of a candidate for a job before it hiring
  • etc.

Concepts

During the framework development some words got new meanings:

  • Crawler: The data collector.
  • Harvest: The execution.
  • Dependencies: Required previous data.

Also, each crawler has its possible-to-achieve crop after the harvest. Each crawler works in one or more different entities, where it contextualizes and store the collected data.

Requirements

Everything below can be easily installed via setuptools.

  • python 3.x
  • requests
  • PyPDF2
  • selenium
  • pyslibtesseract
  • aylien-apiclient

The you need to install:

  • phantomJS
sudo apt-get install phantomjs

Other Resources

Relax, some better docs will come soon.

You can find more info about the framework - and get some feed about its development through this blog post.

You can also check the slides from a presentation made at XI Pylestras about the framework here.

Roadmap

  • Simplify the code and make it easier to work on the development of the framework itself.
  • Create a graphical interface (GUI) to make it more accessible to beginners.
  • Implement analysis and inferences about the collected data.

Contributing

Contributions are very welcome! If you'd like to contribute, these guidelines may help you.

History

See Releases for detailed changelog.

License

MIT License © ZETA Research.

About

Framework extensível para mineração de dados

http://zetaresearch.github.io/projects/spyck

License:MIT License


Languages

Language:Python 100.0%