Crime and political corruption analysis using data mining, machine learning and complex networks
There has been a remarkable increasing in the amount of stored data by private and public companies. On one hand, these huge amounts of data enable a detailed historical review of the processes under investigation; on the other hand, this excess of data makes harder to extract summarized information and also to make good decisions supported by well-established empirical facts. This modern phenomenon has been called a big data and understanding these systems and extracting patterns from these data requires a multidisciplinary approach. In this sense, during the course at the School of Applied Mathematics in the Institute of Mathematics and Computer Science at University of São Paulo we will address topics that involve computer science, statistics, and physics to understand these systems. Among the topics, we will focus on the following ones:
- Introduction to Python;
- Web scraping;
- Data mining;
- Machine learning;
- Complex networks.
Using these tools, we will focus on two issues that are of great relevance in Brazil: predicting homicides in cities and describing the mechanism behind political corruption networks. In the first topic, we will use machine learning techniques to predict the number of crimes in Brazilian cities. In the second topic, we will use complex networks to describe the interaction between politicians investigated in corruption scandals in Brazil from 1987 to 2014.
Any comments, questions, or concerns can be directed to:
- Luiz G. A. Alves lgaalves@northwestern.edu
Course Syllabus
This course is broken up into several modules with each module having a set of Jupyter notebooks to help teach concepts.
Basics, Collections and Files (Day 1)
- Jupyter Notebook
- Basic Data Types
- Flow Control
- Errors
- Lists, Tuples, and Sets
- File I/O
- Section Review (Optional)
Imports, Plots, Functions, Dictionaries, and Web Scraping (Day 2)
- The Python Standard Library
- Data Visualization
- Functions
- Review (Optional)
- Dictionaries
- Review (Optional)
- Mini-Project
- Web Scraping
Data Mining, Statistics, and Data Analysis (Day 3)
- Statistical analysis with Python
- Bootstrapping MC chains
- More stats with Python
- The Bootstrap
- Structured Data Analysis Pt1
- Structured Data Analysis Pt2
Machine Learning Part I (Day 4)
- Data Loading
- Introduction to Scikit Learn
- Unsupervised Transforms
- Cross-validation and Grid Search
- Preprocessing
Machine Learning Part II (Day 5)
- Linear Models for Regression
- Linear Models for Classification
- Trees
- Random Forests
- Gradient Boosting
- Homicides Prediction
Complex Network and Analysis of Corruption Networks (Day 6)
- Network Basics
- Analysis of Structural Properties
- Network Vizualization and Queries on Networks
- Network Analysis from Data
- Corruption Network
igraph
and leidenalg
(Extra)
Social Network Analysis Using
Software Installation
This bootcamp uses the Anaconda Python 3.7 distribution
You must have Anaconda Python 3.7 installed before the first day of class
Downloading Course Materials
The course materials can be downloaded from the repository's github page.
Just download the zip file, unzip it onto your Desktop, and rename the directory school-of-applied-math
.
Usage of Course Materials
This text and the majority of the course will conducted with Jupyter Notebook http://jupyter.org. Jupyter Notebook is a 'web-based interactive computational environment', meaning that it allows to write and execute python code in a web page from your own computers. Jupyter Notebook is a relatively new tool and we believe that is an excellent way to teach the basics of python programming and computational data analysis.
Jupyter Notebook is installed by default with the Anaconda Python distribution and can be laucnhed from the Anaconda Navigator program.
Location and period of the course:
Period: July 1 to July 6, 2019.
Hours: 08:00 to 12:00
Location: (Institute of Mathematics and Computer Science at University of São Paulo) / University of São Paulo (rooms of block 3).
Approval Criteria: 85% of attendance and performance of proposed activities.
Target Audience: Senior year students and postgraduate students in applied mathematics, statistics, computer science and physics interested in data science.
Number of vacancies: 20
Enrollment Period: 04/15/2019 to 05/30/2019.
References
- Downey, A. Think Python. (O’Reilly, 2012).
- Mitchell, R. Web Scraping with Python. (O’Reilly, 2018).
- Janert, P. K. Data Analysis with Open Source Tools. (O’Reilly, 2010).
- Friedman, J., Hastie, T., & Tibshirani, R. The elements of statistical learning. (Springer, 2001).
- Newman, M. Networks: An introduction. (Oxford University Press, 2010).
- Alves, L. G. A., Ribeiro, H. V., Rodrigues, F. A. Crime prediction through urban metrics and statistical learning. Physica A 515, 435 (2018).
- Ribeiro, H. V., Alves, L. G. A., Martins, A. F., Lenzi, E.K., Perc. M. The dynamical structure of political corruption networks. Journal of Complex Networks CNY002 (2018).
- Amaral, Luis A. N., Pah, Adam R., et al, NICO 101 - Introduction to Programming for Big Data
- Mueller, A., Introduction to Machine Learning with Python
- Unpingco, J, Python for Probability, Statistics, and Machine Learning
- Derzsy, N., Network Graph Analysis in Python
- Guimera, R., Mossa, S., Turtschi, A., & Amaral, L. N., The worldwide air transportation network: Anomalous centrality, community structure, and cities' global roles. Proceedings of the National Academy of Sciences, 102(22), 7794-7799 (2005).
- Guimera, R., & Amaral, L. A. N., Functional cartography of complex metabolic networks. nature, 433(7028), 895 (2005).
- Traag, V., Computational Social Science (CSS) Workshop