Wittline / SparkSQL-with-Python

This repository has some examples of using Spark and SparkSQL with Python through PySpark

Home Page:https://wittline.github.io/SparkSQL-with-Python/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

SparkSQL with Python

This repository has some examples of using Spark and SparkSQL with Python through PySpark

Profeco

We will work with the Profeco dataset, which you can download here: Profeco , is a daily historical record of more than 2,000 products, as of 2015, in various establishments in Mexico

Check the code here

  • How many records are there?
  • How many categories are there?
  • How many trade chains are being monitored (and therefore reported in that database)?
  • What are the most monitored products in each state of the country?
  • What is the trade chain with the greatest variety of monitored products?

Countries airports

Check the code here

API to count the number of tweets in a radius of 1km

I will separate in another file "tweets_geo.csv" all the different tweets with their geographic data information, this will help in the manipulation of this data in a query with sparkSQL

Check the data preparation code here

The details of the code for the API REST is in the folder API in this repository

alt text

alt text

alt text

Contributing and Feedback

Any ideas or feedback about this repository?. Help me to improve it.

Authors

License

This project is licensed under the terms of the MIT license.

About

This repository has some examples of using Spark and SparkSQL with Python through PySpark

https://wittline.github.io/SparkSQL-with-Python/


Languages

Language:HTML 99.6%Language:Python 0.4%