Messy data has an inconsistent or inconvenient format, and may have missing values. Noisy data has measurement error. Data mining extracts meaningful information from messy, noisy data. This is a multi-step process which includes gathering, cleaning, visualizing, modeling, and reporting.
This course provides an introduction to Python, with emphasis on data mining and other statistical applications. Students who finish the class should understand the following concepts (listed by complexity) and be able to complete relevant tasks using Python.
Uses of Python: Python is a high-level programming language in widespread use. Although it's a general-purpose language, it's important to recognize its strengths and weaknesses.
Reading and Writing Docs:
Being able to read code documentation empowers students to educate themselves.
This entails developing a new vocabulary, e.g., what **kwargs
means in a
function definition.
Writing code documentation is equally important, especially in collaborative projects. Consider the benefits of official docs like those found in Pandas.
Reporting Results: The results of data analysis should be presented in a clear, reproducible way. IPython notebooks make it easy to combine writing, visualizations, and code.
- Writing should be more than just a summary. Critical thinking is essential!
- Graphs should be well-labelled, and convey information of interest without clutter.
- Use LaTeX or Markdown for appropriate formatting. Export to HTML, and optionally to PDF using pandoc and LaTeX.
Code Organization: Functions and classes are the basic building blocks used to organize code. Breaking data analysis tasks into small, modular steps makes code easier to read, write, test, and reuse.
Database Queries: Join two different data sets on a common key using join and merge methods. Understand left, right, inner, and outer joins.
Numerical Computing: Efficiency and numerical stability are essential to any algorithm. NumPy's n-dimensional arrays (ndarray) support fast vectorized operations. NumPy and SciPy provide much of the same functionality as R's built-in functions.
Effective pandas Use: The pandas module is a good foundation for most data analyses.
- Select and filter rows / columns
- Read various formats including CSV, HTML, JSON, XML, HDF5
- Write CSVs
- Handle missing data, or
NaN
s - Process strings, with methods such as
dataframe.str
- Use group-by operations, such as
groupby.apply
, which resembles R's apply functions andddply
fromplyr
Appropriate Abstraction:
Many libraries provide similar functionality at different levels of
abstraction. One example is
requests
compared to urllib2
for HTTP.
High level libraries should be chosen when possible for data analysis tasks.
3rd or 4th year undergraduates, master's students, and 1st or 2nd year PhD students. The students should have familiarity with programming, but not necessarily using Python. The intention is to avoid excessive overlap with STA 141 and ECS 10 (basic programming via Python). Completion of one of these two courses could be recommended or a pre-requisite.
Python 3 has syntax changes and new features that break compatibility with
Python 2.
All of the major scientific computing libraries have added support for Python 3
over the last few years, so it will be our focus.
We recommend the Anaconda Python 3 distribution,
which bundles most packages we'll use into one download.
Any other packages needed can be installed using pip
or conda
.
Python code is supported by a vast array of editors.
- Spyder IDE, included in Anaconda, is a Python equivalent of RStudio, designed with scientific computing in mind.
- PyCharm IDE has very well-designed user interface. Chris uses PyCharm with the IdeaVim Vi plugin.
- General-purpose text editors, such as Vim and Emacs, are a great choice for ambitious students. They can be used with any language. See here for more details. Clark and Nick both use Vim.
The main reference text will be Wes McKinney's book
- McKinney, W. (2012). Python for Data Analysis: Data Wrangling with Pandas, NumPy, and IPython. O'Reilly Media.
General purpose references include
- Lutz, M. (2014). Python Pocket Reference. O'Reilly Media.
- Beazley, D. (2009). Python Essential Reference. Addison-Wesley.
- Pilgrim, M., & Willison, S. (2009). Dive Into Python 3. Apress.
- StackOverflow. Please be conscious of the rules!
There are also many free online resources for learning to program in Python:
- Non-programmer's Tutorial for Python 3
- Beginner's Guide to Python
- Swaroop, C. H. (2003). A Byte of Python.
- Reitz, K. Hitchhiker's Guide to Python (PDF).
- Five Lifejackets to Throw to the New Coder
- Pyvideo. Recommended speakers include Guido Van Rossum, Raymond Hettinger, Travis Oliphant, Fernando Perez, David Beazley, and Alex Martelli.
Moreover, most of the packages we'll cover have excellent documentation:
- Python 3 (including the standard library)
- NumPy
- SciPy
- matplotlib
- pandas
- IPython
- scikit-learn
The core topics will be
- Python Basics
- syntax
- tuples, lists, dicts (and possibly
collections
) - list comprehensions
- iterators (and possibly generators)
- string manipulation
- documenting code
- Numerical computing (
numpy
andscipy
) - IPython
- Plotting (
matplotlib
orggplot
) - Data manipulation (
pandas
) - Web-scraping (
requests
andbs4
)
Other topics we may cover include
- File I/O (
io
andos.path
) - Command-line argument parsing (
argparse
) - Debugging (
pdb
) - Profiling (
timeit
andcProfile
) - Test-driven development (
doctest
) - Object-oriented programming
- Functional programming
- Statistical methods (
sklearn
) - Database queries