This repository contains the materials for the Data acquisition, wrangling and exploratory analysis in Python, three days intensive CADi ("Cursos de Actualización en las Disciplinas") for faculty members at "Tecnológico de Monterrey" Institute.
The course covers subjects include the parsing and handling of data from different social sources, as well as the use of current frameworks for data-driven analyses.
For other data-analysis related topics please take a look at the dataViz_CADi repository. Which contains exercises on data visualization in R, Python and Mathematica.
This workshop was created with flexibility in mind. As such, modules are fairly independent and can be followed in a different order than the one suggested here. For a topic-oriented breakdown of the contents, please have a look at the sitemap.
- Introduction: Objectives, scope, requirements and expectations.
- Python 101: Introduction to the programming language (description, core types, collections and functions).
- Python Environments: Using anaconda and virtualenv for development.
- Pypi: Installing, browsing, and handling python packages.
- IDE's: Using IDLE, Jupyter, Spyder, nteract, and Atom to write and launch our code.
- Git: Version control using github for code development, sharing and collaboration.
- Data Wrangling: Primer: Data science and how does data wrangling fit into it.
- Data Wrangling: Part 1: Using pandas and matplotlib.
- Data Sources: Twitter: Interfacing with the API to get trends, tweets, and tags.
- Intermediate Python: Dealing with files, serialization and a simple cases of parallel computing.
- Data Wrangling: Part 2: Using scikit-learn to parse, manipulate, and pre-analyze data.
- Data Sources: Google Trends: Retrieving trends from google searches.
- Twitter: Tweets and text sentiment analysis.
- Python pkg: Creating and installing a custom python package.
- Advanced Python: Higher-level topics (garbage collection, lambda functions).
- A Story to Tell: Data-driven storytelling.
- Data Sources: Part 3: Obtaining data from Web Scraping (beautifulsoup), RSS (XML), Dropbox API.
- GeoData: How to work with geographic datasets with geopandas and osmnx.
- anaconda: DataScience/Package manager platform for python and R.
- atom: Versatile IDE for R, Python, Markdown, Javascript, amongst others.
- matplotlib: Python's most popular package to plot data.
- numpy: Highly efficient array manipulation in Python.
- pandas: Popular dataframe manipulation in Python.
- plotly: A good alternative for interactive plots in Python (similar to Shiny in R).
- onlinegdb: Online Python interpreter (originally developed for C and C++).
- repl.it: Online Python IDE and interpreter (also supports many other languages).
- scikit-learn: Data analysis and machine learning platform for python.
- sympy: Symbolic calculus in Python.
- Google Earth Studio: Useful to create geographic visualizations (currently under beta program).
- Scrapy: Web-scrapper application for Python
- BeautifulSoup: An approachable web scraper application.
- Spacy: Advanced natural language analysis library.
- NLTK: Natural language toolkit for python.
- Seaborn: Documentation for the seaborn statistical visualization package.
- xlrd: Excel data reader.
- Anaconda documentation: Documentation for the anaconda environments manager.
- dataViz Book: Online book with data visualization examples and principles.
- dataViz CADi: "Data Visualization" CADi bootcamp taught in December of 2018 with code examples in Python, R and Mathematica.
- Git Carpentry Workshop: A good git/github introduction for Spanish-speaking audiences (with lots of examples and explanations).
- Python 3.7 documentation: Official python documentation with examples of use of the built-in functions.
- Mists of Data: Ricardo's Andrade personal blog devoted to data analysis in Python with code examples.
- Numpy documentation: Examples of use and developer guides for the popular multidimensional array package.
- SciPy documentation: Guides for the most popular package for scientific computing in python.
- Virtualenv documentation: Tutorials and documentation for python's built-in environment manager.
- DataCamp: Online courses on data analysis.
- Towards Data Science: Hub with interesting blog posts on data science with code and datasets.
- Python Cheat Sheets: A compendium of python useful cheat sheets.
- Data Wrangling with Python: Repository of code with relevant exercises for data wrangling in Python.
- Boehmke, Ph.D., Bradley C. Data Wrangling with R. O’Reilly, 2016. https://doi.org/10.1007/978-3-319-45599-0.
- Theodore Petrou (2017). Pandas Cookbook.
- Scott Chacon and Ben Straub (2019). Pro Git.
- McKinney, W. Python for Data Analysis - Data Wrangling with Pandas, Numpy and Python. (2018). ISBN-13: 1491957662
- Géron, Aurélien (2018). Hands-On Machine Learning with Scikit-Learn and TensorFlow: Concepts, Tools, and Techniques to Build Intelligent Systems
- Lutz, M., & Ascher, D. (20015). Learning Python.
- Lubanovic, B. (2015). Introducing python, modern computing in simple packages.
- Lutz, M. (2014). Python Pocket Reference.
- Beazley, D. (2013). Python Cookbook.
- Russell Mathew A. (2013). Mining the Social Web: Data Mining Facebook, Twitter, Linkedin, Google+, Github, And More
- Cairo, Alberto (2016). The truthful art: data, charts, and maps for communication. ISBN-13: 978-0321934079
- Foster Provost, Tom Fawcett. Data science for business.
- Kirk, A. (2016). Data Visualisation: A Handbook for Data Driven Design. ISBN-13: 978-1473912144
- Yau, N. (2011). Visualize this : the FlowingData guide to design, visualization, and statistics. Wiley Pub. ISBN-13: 978-0470944882
- Yau, N. (2013). Data points: visualization that means something. ISBN-13: 978-1118462195
- Lutz, Mark, and David Ascher (2004). Learning Python. Learning. ISBN-13: 978-9351102014
- Matthes, E. (2016). Python Crash Course - A Hands-On, Project-Based Introduction to Programming. No Starch Press.
Rick Leigh Swenson Durie • Humberto Cárdenas Anaya • Norma Amanda Elías Solís • Rubén Darío Santiago Acosta • Faustino Yescas Martinez • Raúl Gómez Castillo • Luis Angel Trejo Rodríguez • Jorge Adolfo Ramírez Uresti • Ariel Ortíz Ramírez • Lucio López Cavazos • Pedro Oscar Pérez Murueta • María del Consuelo Serrato Arias • Alfredo Santana Díaz • Roberto Martínez Román • José Luis Gómez Muñoz • Jesús Cuauhtémoc Téllez Gaytán • Manuel Sotelo Duarte • Jorge Sastré Hernández • Ricardo Mendez Hernandez • Luis Enrique Villagómez Guerrero • Francisco Javier Rojas Correa • András Takács • Oriam Renan De Gyves López • Oscar Antonio Osorio Pérez • Miguel Angel Medina Pérez • Yocanxóchitl Perfecto Avalos • Jesús Arturo Escobedo Cabello • Hector Javier Medel Cobaxin
Contact: [ sanchez.hmsc@berkeley.edu | chipdelmal@gmail.com ]
My main projects: [ MGDrivE & MoNeT ]
My personal website: [ chipdelmal.github.io ]