vtshen / introduction-to-data-science

Course INFO 490 at UIUC

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Introduction to Data Science ===========

Week 1: Introduction to Unix

Lesson 1: Virtualization and Dockers

  • dockerizing applications:
    • run an application: sudo docker run ubuntu /bin/echo 'hello world'
    • run "hello world" application in the background (as a daemon): sudo docker run -d ubuntu /bin/sh -c "while truel; do echo hello world; sleep 1; done"
    • check the log of daemonized hello world: sudo docker logs [ID of the container]
    • stop the container: docker stop

Lesson 2: The Unix Shell

  • pwd, ls, cd

  • determine a file's type: file [filename]

  • view a file contents encoded in ASCII code: less

  • manipulating files and directiries: cp, mv, mkdir, rm, ln

    • it would be safe to use cp -i ... and mv -i ... (interactive mode)
    • be careful with rm! Linux does not have an undelete command! So make sure what you are going to delete (you can use ls to check). e.g. if you type rm *.html you will delete all files with a name ended with 'html', but if you mistakenly type rm * .html, this additional whitespace will wipe out all of your files! (use ls * .html to check in this case)
    • wildcards (used to match file names): *, ?, [characters], [!characters], [[:class:]]
  • hard links and symbolic links

  • reference

Lesson 3: Basic Unix Concepts

  • working with commands

    • type: indicate how a command name is interpreted
    • which: display which executable program will be executed
    • help: get help for shell builtins
    • man: get the manual
    • apropos: display a list of appropriate commands
    • info: display a command's info entry
    • whatis: display a very brief description of a command
    • alias: create an alias for a command
  • redirection

    • default input and output: stdin, stdout, stderr (by default both stdout and stderr are linked to the screen, and stdin attached to the keyboard)
    • redirect stdout
      • overwrite the file if existed: ls -l /usr/bin > ls-output.txt
      • append the output to the file: ls -l /usr/bin >> ls-output.txt
    • redirect stderr
      • file descriptor: 0 standard input, 1 standard output, 2 standard error
      • ls -l /usr/bin 2> ls-output.txt
    • redirect both stdout and stderr (two methods)
      • ls -l /bin/usr > ls-output.txt 2>&1 where 2>&1 redirect stderror to stdout. The order is important!
      • ls -l /bin/usr &> ls-output.txt
    • disposing of unwanted output
      • redirect the output to /dev/null e.g. ls -l /bin/usr 2> /dev/null
    • redirect stdin
      • cat
      • pipeline: command1 | command2 e.g. ls /bin /usr/bin | sort | uniq | wc
      • wc: word count
      • grep [pattern] [file...]: print lines matching a pattern, a pattern is a regular expression
      • head/tail: print first/last part of files
      • tee: read from stdin and output to both stdout and files (so the stdout can go down the pipeline and the intermediate stage data can be captured by files). e.g. ls /usr/bin | tee ls.txt |grep zip
  • seeing the world as the shell sees it

    • expansion
      • pathname expansion: echo *, echo [[:upper:]]*, echo *d, echo /bin/*
      • tilde expansion: echo ~
      • arithmetic expansion: echo $((2*2)), echo five divided by two is $((5/2))
      • brace expansion
        • echo front-{a..f}-back
        • mkdir -p playground/dir-{001..100}, touch playground/dir-{001..100}/file-{A..Z} (create 2600 files!!)
      • parameter expansion: echo $USER, echo $PATH, echo $HOME, echo $SHELL, echo $PS1 (prompt string 1, see later)
      • command substitution
        • examples: echo $(ls), ls -l $(which cp), file $(ls -d /usr/bin/* | grep zip)
        • a similar application in docker: docker stop $(docker ps -a -q) (it stops all containers)
        • note: ls | echo is not equivalent to echo $(ls) because the input of echo cannot be the standard input, it should be some arguments. we can use xargs to convert it to ls | xargs echo, which is equivalent to echo $(ls)
    • quoting
      • double quotes: ls -l "two words.txt", echo "$((2+2)) $PATH"
      • single quotes: echo '$((2+2)) $PATH' (single quotes supress the expansion)
      • escaping characters: echo this is \$5.00 (so the expansion is suppressed)
  • advanced keyboard tricks

    • command line editing
    • completion: using tab
    • using bash history: !!, !number, !string, !?string
  • permissions

    • Linux is a multi-user system
    • owners, group members and everybody else
    • reading writing and executing
      • ls -l shows the information of files: the first ten characters are the file attributes (consist of the file type and the permission attrs)
      • chmod
      • umask: default permission mode
    • changing identities
      • su
        • start an interactive command: su [-[l]] [user] (enter exit to log out)
        • execute a single command: su -c 'command'
      • sudo
      • chown, chgrp
    • exercising our privileges
    • changing your password
  • processes

    • viewing processes
      • ps, ps x, ps aux, top, jobs
    • controlling processes
      • bg, fg
      • kill, killall
      • shutdown

Week 2: Introduction to CLI Data Science

Lesson 1: Unix File Processing and Management

  • the environment

    • shell variables (data placed by bash), and environment variables (everything else)
    • printenv: print part or all of the environment
      • printenv | less
      • printenv $USER
    • set: display both the shell and environment variables, as well as defined shell functions
    • alias
      • show all alias defined in shell
      • create an alias for a command
    • how it the environment established?
      • startup files: in /etc and the user's home directory, used to configure the environment
    • modifying the environment
      • which files should we modify?
      • text editors
    • export: export environment to subsequently executed programs
  • a gentle introduction to vi

    • quit and save: :q!, :w, ZZ, :wq
    • moving the cursor: 0, ^, $, w, W, b, B, 5G, G, hjkl
    • basic editing
      • appending text: a, A
      • opening a line: o, O
      • deleting text: x, 3x, dd, 5dd, dW, d$, d0, d^, dG, d20G
      • undo and redo: u, ctrl+r
      • copying: (basically same with deleting): yy, 5yy, yW, y$, y0, y^, yG, y20G
      • pasting: p, P, 5p
      • joining lins: J
    • search and replace
      • searching within a line: fa
      • searching the entire file: / (together with strings or regular expressions)
      • global search and replace
        • an example: :%s/Line/line/g
          • %: specify the range of this operation
          • s: specify the operation: substitution
          • /Line/line/: the search pattern and the replacement text
          • g: "global" option
    • editing multiple files
      • n, N, buffers
      • opening additional files for editing: :e
      • copying content from one file into another
      • inserting an entire file into another: :r
  • networking

    • examining and monitoring a network
      • ping: sends a special network packet called IMCP ECHO_REQUEST to a specified host and get responses, see if some packets are lost
      • traceroute: trace all network "hops"
      • netstat: examine various network settings and statistics
    • transporting files over a network
      • ftp and lftp: trasmitting everything in clear text, not secure
      • wget: download content from both web and FTP sites
    • secure communication with remote hosts
      • ssh: log into a remote machine and execute commands there
      • scp and sftp
  • searching for files

    • locate: find files by name
      • e.g. locate bin/zip
      • use sudo updatedb to update the database for locate
    • find: find files the hard way
      • lots of tests we can use, such as file type, name, size, permission, user etc
        • e.g. find ~ -type f -name "*.jpg" -size +1M -perm 0600 -user me | wc -l
      • operators: describe the logical relationships between the tests
        • e.g. find ~ \( -type f -not -perm 0600 \) -or \( -type d -not -perm 0700 \)
      • predefined actions: doing something on the matching files
        • some actions: -delete, -ls, -print (default option), -quit
        • e.g. fubd ~ -type f -name '*.BAK' -delete
      • user-defined actions: -exec command '{}' ';' or -ok command '{}' ';' (with prompted to confirm)
        • e.g. find ~ -type f -name 'foo*' -exec ls -l '{}' ';', find ~ -type f -name 'foo*' -ok ls -l '{}' ';'
      • improving efficiency
        • e.g. find ~ -type f -name 'foo*' -exec ls -l '{}' '+'
      • xargs: accept input from standard input and convert it into an argument list for a specific command
        • e.g. find ~ -type f -name 'foo*' | ls -l is wrong, because ls does not accept the standard input as its input. So we need to convert it as find ~ -type f -name 'foo*' | xargs ls -l
      • options: control the scope of the search
        • -depth, maxdepth [levels], mindepth [levels], -mount
  • customizing the prompt

    • anatomy of a prompt
      • echo $PS1: display this prompt string 1, we can change it to anything we like and add colors

Lesson 2: Unix Data Processing

  • archiving and backup

    • compressing files
      • gzip and gunzip
      • bzip2 and bunzip2
    • archiving files
      • tar
      • zip: mainly used to exchange files with Windows system
    • synchronizing files and directories
      • rsync [options] [source] [destination] (where source or dest could be either local files or remote files)
        • e.g. rsync -av /bin /etc /backup (here we backup /bin and /etc to /backup directory)
        • to make it simple, we can use an alias: alias backup='rsync -av /bin /etc /backup'
      • using rsync over a network
  • regular expressions

    • grep
      • function: output any lines that contain the specified regular expression
    • metacharacters and literals
    • any character: .
    • anchors: ^ (beginning of the line), $ (end of the line)
      • e.g. ^.zip
    • bracket expressions and character classes
      • e.g. [bg]zip
      • negation: [^bc]zip
      • traditional character ranges: [A-Za-z0-9], [a-d]zip
      • POSIX character classes: [:alnum:], [:word:], [:alpha:], [:lower:], [:upper:], etc
    • POSIX basic (BRE) vs extended regular expressions (ERE)
      • BRE metacharacters include ^ $ . [ ] *, all others are literals
      • additional metacharacters in ERE are ( ) { } ? + |
      • we need to use grep -E when using ERE
        • e.g. ls /bin | grep -Eh '^(bz|gz|zip)'
      • ? + * and {}: specify the number of times an element is matched
    • putting regular expressions to work
      • find with RE (using -regex option)
        • note the difference between find and grep: find gets the pathnames exactly matching the RE, while grep gets the lines containing the RE
        • e.g. find . -regex '.*[^-_./0-9A-Za-z]', it finds all "bad" pathnames
      • locate with RE
      • searching for text in less and vim
  • text processing

    • revisiting some old friends
      • cat
        • -A: show all characters including non-printing ones
        • -n: show line numbers
        • -s: suppress the output of blank lines
      • sort
        • e.g. sort and merge: sort file1 file2 file3 > final_sorted_file
        • some interesting options: -n, -k (sorted based on a key field), -t
      • uniq
        • perform on a sorted file
    • slicing and dicing
      • cut
        • -c (char_list), -f (field_list), -d (define delimiting character), --complement
        • e.g. cut -f 3 distro.txt | cut -c 7-10, cut -d ':' -f 1 /etc/passwd
      • paste: merge lines of files
      • join: join data from multiple files based on a shared key field
    • comparing text
      • comm
      • diff
        • -c (context format)
      • patch
        • accepts output from diff and applies changes to text files
        • it would be very efficient to use diff/patch, especially when the original text files are large and the diff files are relatively small (especially useful for collaborative software developing)
    • editing on the fly
      • tr: transliterate characters
        • e.g. echo "lowercase letters" | tr a-z A-Z, -s (delete repeated instances)
      • sed: stream editor, very powerful and complicated
        • e.g. echo "front" | sed '1s/front/back/g'
          • explain: 1 is the first line (range), s means substitution (operation), front and back are the args of the operation, g means that the substitution is global
        • address notation: specifies the range of the operation
          • line range, regular expressions, etc
          • e.g. sed -n '/SUSE/p' distros.txt (where /SUSE/ is the regular expression notating the address)
        • basid editing commands: specifies the operation
          • output, append, delete, insert, print, quit, substitute, transliterate, etc
        • more complex editting: sed script with option -f
      • aspell: interactive spelling checker
        • e.g. aspell check foo.txt
  • writing your first script

    • how to write a shell script
      • write a script
      • make the script executable
      • put the script somewhere the shell can find it
    • script file format
      • an example (saved as hello_world)
      #!/bin/bash
      # this is a script
      echo 'hello world!'
    • executable permissions
    • script file location
      • we can execute the previous program by ./hello_world
      • if we want to use hello_world to execute it, we need to add the current directory to $PATH, or move this script to one of the $PATH directories
    • more formatting tricks
      • long option names
        • e.g. ls -a vs ls --all
      • indentation and line-continuation: make scripts more readable by using indentation (using \ to continue a line)
  • awk tutorials

Lesson 3: Source Code Control with Git & GitHub

  • getting started (http://git-scm.com/book/en/v1/Getting-Started-Installing-Git)

    • about version control
      • local version control systems
      • centralized version control systems
      • distributed version control systems
    • a short history of Git
    • Git basics
      • snapshots, not differences
      • nearly every operation is local
      • Git has integrity
      • Get generally only adds data
        • we can easily undo operations and recover data
      • the three states
        • working directory, staging area, repository (git directory)
    • installing git
  • Git basics

    • getting a Git repository
      • initializing a repository in an existing directory
        • go to the project's directory and git init, git add
      • cloning an existing repository: git clone [url]
    • recording changes to the repository
      • git status, git diff
      • git commit
      • ignoring files: by editing .gitignore file
      • git rm, git mv
    • viewing the commit history
      • git log
        • git log -p: show the difference
        • git log --pretty=format:"%h - %an, %ar : %s": formatting output
        • git log --since=2.weeks: limiting log output
    • undoing things
      • changing your last commit: git commit --amend
      • unstaging a staged file: git reset HEAD <file>
      • unmodifying a modified file: git checkout -- <file>
    • working with remotes
      • showing your remotes: git remote -v
      • adding remote repositories: git remote add [shortname] [url]
        • e.g. git remote add origin https://github.com/weiHelloWorld/python.git
      • fetching and pulling from your remotes: git fetch [remote-name]
      • pushing to your remotes: git push [remote-name] [branch-name]
        • e.g. git push origin master
      • removing and renaming remotes
    • tagging
    • tips and tricks
      • auto completion
      • Git aliases
  • Git branching

  • GitHub

Week 3: Introduction to iPython

Lesson 1: Working with the iPython Notebook

Lesson 2: Introduction to Python Programming

Lesson 3: Python Functions

Week 4: Introduction to Python Programming

Lessom 1: Python Data Structures

  • string

  • lists

    • note that list is mutable, while string is immutable
    • shallow copy and deep copy of a list
      • shallow copy: b = a
      • deep copy: b = a[:]
    • when a list is used as an argument, it is passed by reference, so it may be modified in the process
  • dictionaries

    • the order of items is unpredictable
    • global variable
      • to reassign a global variable inside a function we need to declare the global variable before you use it, e.g.
      count = 0
      
      def example():
          global count # in this way, count will be referred to the global variable within this function
          count += 1
    • hash function
  • tuples

    • tuples are immutable
    • variable-length argument tuples
      • the parameters should begin with '*', e.g.
      def printall(*args):
          print(args)
    • lists and tuples
      • zip
    • dictionaries and tuples
      • items
    • sequences of sequences

Lesson 2: Working with the Underlying File System

  • files
    • reading from text files
      • character encoding
      • open, read, seek, close
      • closing files automatically
        • try...finally approach: good
        • with approach: better! see this example:
        with open("temp.txt") as a_file:
            a_file.seek(17)
            a_character = a_file.read(1)
            print(a_character)
        # when this block ends, the with statement will close the file automatically, no matter what exception happens
      • read one line at a time
    • writing into text files
      • two modes: a, w
    • binary files
    • stream objects from non-file sources
      • handling string
      a = 'hello'
      import io
      b = io.StringIO(a)
      b.write('df') # modify the first two characters of this string
      b.seek(0) # go to the beginning of this string
      b.read()
      b.write('df') # now append 'df' to the end of this string
      b.seek(0) 
      b.read()
      • handling compressed files
    • standard input, output and error
      • sys.stdin, sys.stdout, sys.stderr: similar to those in Linux (see "redirection" in Week 1 Lesson 3), see example:
      import sys
      sys.stdout.write("abc")
      sys.stderr.write("dfsfsd")
      • redirecting standard output

Lesson 3: Advanced Concepts

  • comprehensions

    • working with files and directories
      • os, os.path module
    • list/directory/set comprehensions
      a = range(10)
      [element ** 2 for element in a if element > 5]
  • regular expressions

    • an example: matching phone numbers
    import re
    phonePattern = re.compile(r'^\D*(\d{3})\D*(\d{3})\D*(\d{4})\D*(\d*)$') # \D matches any non-numeric character, \D* is used as an optional separator
    phonePattern.search('(800)123-3453 ext 1234').groups() # groups patterns in the string
  • classes and objects

  • classes and functions

  • classes and methods

    • some special methods: __init__, __str__, __doc__, __iter__, __next__ (the last two are for iterators)
    • operator overloading: e.g. __add__
    • type-based dispatch
    • polymorphic functions: functions that can work with several types
  • inheritance

Week 5: Introduction to Visualization

Lesson 1: Introduction to Numpy

  • the basics

    • the Numpy's main object is the homogeneous multidimensional array, it belongs to class numpy.ndarray (or numpy.array)
    • attributes of array: ndim, shape, size, dtype, itemsize, data
    • creating array: array([1,2,4]), zeros, linspace, arange, reshape, etc
    • basic operations
    • universal functions: sin, cos, exp, etc
    • indexing, slicing and iterating
      • boolean indexing (also very useful in pandas)
      import numpy
      A = numpy.arange(10)
      c = (A < 5) & (A > 1)
      A[c]
      • list-of-locations indexing
  • shape manipulation

    • changing the shape of an array
      • ravel(), transpose(), resize(), reshape()
    • stacking together different arrays
      • hstack, vstack, column_stack, row_stack
    • splitting one array into several smaller ones
  • copies and views

    • no copy at all: same array object
      • b = a (we can see b as a new pointer to object referred by a)
    • view or shallow copy: different array objects, share the same data
    • deep copy: different array objects and different copies of data

Lesson 2: Introduction to Python Plotting with Matplotlib

Lesson 3: Making Data Visualizations in Python

  • histograms

  • seaborn

    • matplotlib is useful, but not very satisfactory, seaborn provides an API on top of matplotlib which uses sane plot & color defaults, uses simple functions for common statistical plot types, and which integrates with the functionality provided by pandas dataframes.
  • data looks better naked

Week 6: Introduction to Data Analysis in Python

Lesson 1: Data Access and Selection

Lesson 2: Data Manipulation and Analysis

  • some examples
    • Which borough has the most noise complaints (or, more selecting data)
      • boolean indexing
      • value_counts
    • Find out on which weekday people bike the most with groupby and aggregate
      • two arguments when parsing dates: parse_dates=['Date'], dayfirst=True, so that we can use the functionality related to dates and time
      • use of groupby and aggregate: weekday_counts = berri_bikes.groupby('weekday').aggregate(sum)
    • Combining dataframes and scraping Canadian weather data
      • clear null data: weather_mar2012.dropna(axis=1, how='any') (drop any data that are emtpy, axis = 1 means drop columns instead of rows)
      • drop data: weather_mar2012.drop(['Year', 'Month', 'Day', 'Time', 'Data Quality'], axis=1)
    • String Operations- Which month was the snowiest
      • conversion and resampling of regular time-series data using resample
        • weather_2012['Temp (C)'].resample('M', how=np.median).plot(kind='bar')
        • is_snowing.astype(float).resample('M', how=np.mean).plot(kind = 'bar')
    • Cleaning up messy data
    • reference

Lesson 3: Summary Statistics

  • think stats (http://www.greenteapress.com/thinkstats/thinkstats.pdf)

  • simple statistics with scipy

    • introduction
      • scipy is built on top of Numpy and therefore we can use all array manipulation and indexing methods provided by Numpy.
    • descriptive statistics
      • generate random numbers from a standard Gaussian: sp.randn(100)
      • min, max, median, var, etc.
    • probability distributions
      • Scipy has functions that deal with at lease 81 common probability distributions.
      • e.g. n = stats.norm(loc=3.5, scale=2.0) (here loc is mean, scale is standard deviation)
    • probability density function (PDF) and probability mass function (PMF)
    • cumulative density function (CDF)
    • Percent point function (PPF) or inverse cumulative function
    • Survival function (SF)
    • Inverse survival function (ISF)
    • Random variates
      • we can draw values from a distribution using rvs.
      • e.g. sp.stats.poisson.rvs(1.0, size = 100) # 100 random values from a Poisson distribution with mu = 1.0
    • reference

Week 7: Introduction to Statistical Analysis

Lesson 1: Rules of Probability Theory

  • Bayes theorem

    • diachronic interpretation: probability information gets updated when new evidence come in. prior probability -> posterior probability (probability changes over time)
  • computational statistics

  • estimation

    • the locomotive problem
      • description: "A railroad numbers its locomotives in order 1..N. One day you see a locomotive with the number 60. Estimate how many locomotives the railroad has."
      • the estimation depends both on prior probability (uniform prior from 1 to 1000, or power law prior) and number of data (in this question, we have only a data point, increasing the number of data would significantly make the estimation more convincing and less dependent on the prior you choose)
    • credible intervals
    • informative prior vs. uninformative prior
  • reference

Lesson 2: Statistical Decision Making

  • hypothesis testing

    • introduction
      • when we see some "apparent effects" in a data set, the first question to ask is whether they are real or simply occurred by chance
      • general structure to test statistical significance (similar to proof by contradiction)
        • null hypothesis
        • p-value
        • interpretation
    • testing a difference in means
    • choosing a threshold
      • Type I error (false positive), and Type II error (false negative)
      • choose a threshold $\alpha$, to determine whether to accept the hypothesis based on whether p-value is less than $\alpha$
      • obviously the false positive in this case is $\alpha$. By decreasing $\alpha$, we can decrease false positive, but we increase false negative, so there is a tradeoff between Type I and Type II errors
    • defining the effect
      • two-sided test and one-sided test
    • interpreting the result
      • three ways of interpretation
        • classical: simply compare p-value and $\alpha$ and get the result
        • practical: report p-value without apology, readers interprete the data themselves
        • Bayesian: let $H_0$ and $H_A$ to be the hypothesis that the effect is not real and that the effect is real. the evidence (data set) is E, we want to calculate $P(H_A | E) = \frac{P(E | H_A) P(H_A)}{P(E)}$ based on some prior probability assumption (e.g. we can assume $P(H_A) = P(H_0) = 0.5$)
    • cross-validation
      • use one set of data to formulate the hypothesis, and a different set of data to test it
    • chi-square test
    • efficient resampling
    • power: The probability that a test will reject the null hypothesis if it is false
  • estimation

    • estimation of some parameters of a specific distribution
    • confidence intervals
    • Bayesian estimation
  • correlation

    • standard scores
    • covariance
    • correlation
      • Pearson’s correlation: decide whether two variables have linear dependence
    • scatter plots in python
    • Spearman’s rank correlation
    • least squares fit
    • goodness of fit ($R^2$)
  • reference

Lesson 3: Linear Modeling

  • basic linear regression plotting
%matplotlib inline

import numpy as np, pandas as pd, seaborn as sns, matplotlib as mpl, matplotlib.pyplot as plt
import statsmodels

np.random.seed(sum(map(ord, "linear_quantitative")))
tips = sns.load_dataset("tips")

sns.lmplot("total_bill", "tip", tips, ci = 100); # two parts, one is scatter plot, the other is the regression line
sns.lmplot("size", "tip", tips, x_jitter = .15); # add some jitter to improve the plot
sns.lmplot("size", "tip", tips, x_estimator = np.mean) 
bins = [10, 20, 30, 40]
sns.lmplot("total_bill", "tip", tips, x_bins=bins)
  • Faceted linear model plots
sns.lmplot("total_bill", "tip", tips, hue="smoker", ci = 40, markers = ["x", "o"]); # plot in the same graph
sns.lmplot("total_bill", "tip", tips, hue="smoker", ci = 40, col = "smoker"); # plot in the different graphs

g = sns.lmplot("total_bill", "tip", tips, hue="day", palette="Set2",
               hue_order=["Thur", "Fri", "Sat", "Sun"]) # here lmplot() returns a grid object for further use
g.set_axis_labels("Total bill (US Dollars)", "Tip");
g.set(xticks=[10, 30, 50], ylim=(0, 10), yticks=[0, 2.5, 5, 7.5, 10]);
  • plot different linear relationships
sns.lmplot("total_bill", "tip", tips, hue="time", palette="Set1", fit_reg=False);
sns.lmplot("size", "total_bill", tips, order = 2); # nonlinear plot
  • Plotting logistic regression
tips["big_tip"] = (tips["tip"] / tips["total_bill"]) > .15
sns.lmplot("total_bill", "big_tip", tips, y_jitter=.05, logistic = True); # logistic regression (see http://en.wikipedia.org/wiki/Logistic_regression)
  • Plotting data with outliers
sns.lmplot("total_bill", "tip", tips, robust=True, n_boot=500); # with robust option
  • Plotting simple regression with regplot()

    • regplot() is lower-level of lmplot() and it would give you more control, every time you use lmplot(), you can the lower-level regplot()
  • Examining model residuals using residplot()

  • Plotting marginal distributions using jointplot()

sns.jointplot("total_bill", "tip", tips, kind="reg", color="seagreen"); # plot regression
sns.jointplot("total_bill", "tip", tips, kind="resid", color="#774499"); # plot residue, similar to residue plotting

Week 8: Introduction to Time Series Data

Lesson 1: Introduction to Tim Series Data with Pandas

  • basic oprations
import pandas as pd, numpy as np

rng = pd.date_range('1/1/2011', periods = 72, freq = "H") # create a range of dates
ts = pd.Series(np.random.randn(len(rng)), index=rng)
converted = ts.asfreq('45Min', method = "pad") # change the frequency
ts.resample('D', how = 'mean') # resample data
  • Time Stamps (time points) vs. Time Spans (time periods)
from pandas import *

dates = [datetime(2012, 5, 1), datetime(2012, 5, 2), datetime(2012, 5, 3)] # three time points
periods = PeriodIndex([Period('2012-01'), Period('2012-02'), Period('2012-03')]) # three time periods (each one spams one month)
  • Converting to Timestamps

    • to_datatime (with dayfirst option)
    • invalid data
    • epoch timestamps
      • note the default unit is nanosecond
  • Generating Ranges of Timestamps

    • date_range (default is calendar day) and bdate_range (default is business day)
    index = date_range('2000-1-1', periods=1000, freq='M')
    
    start = datetime(2011,1,1)
    end = datetime(2012,1,1)
    rng = date_range(start, end, freq = "W")
  • DatetimeIndex

    • One of the main uses for DatetimeIndex is as an index for pandas objects.
    • DatetimeIndex Partial String Indexing
      • suppose ts is a Series indexed by a TimeSeries, then we can use partial string such as ts['2011'] or ts['2011-6'] or ts['2011-6':'2011-8'] to extract data
      • see reference for more partial string indexing
    • Truncating & Fancy Indexing
    • Time/Date Components
  • DateOffset objects

    • "DateOffset objects" represent frequency increments, such as month, business day, one hours, etc
    • key features
      • added to a datetime object to obtain a shifted date
      • multiplied by an integer so that the increment will be applied multiple times
      • has rollforward and rollback methods for moving a date forward or backward to the next or previous "offset date"
    • Parametric offsets
    • Custom Business Days
    • Offset Aliases, Combining Aliases, Anchored Offsets, Legacy Aliases
    • Holidays / Holiday Calendars
  • Time series-related instance methods

    • shifting the values in a TimeSeries in time
      • shift: it can accept freq argument or DataOffset class
    • Frequency conversion
      • asfreq
    • Filling forward / backward
      • fillna can fill the NA data associated with asfreq or reindex
    • Converting to Python datatimes
  • Resampling

    • e.g. ts.resample('D', how = 'mean') # resample data
    • how methods: sum, mean, std, sem, max, min, median, first, last, ohlc
    • downsampling option: closed (set to left or right)
    • upsampling option: fill_method, limit
  • Time Span Representation

    • introduction
      • TimeSpan is quite similar to TimeStamp mentioned before. We have Period instead of datetime, period_range instead of date_range.
    • Period
    • PeriodIndex and period_range
    • PeriodIndex Partial String Indexing
    • Frequency Conversion and Resampling with PeriodIndex
      • in time span, we can set how option to start or end of the period when doing frequency conversion
  • Converting between Representations

    • to_period and to_timestamp
  • Representing out-of-bounds spans

  • Time Zone Handling

  • reference

Lesson 2: Introduction to Time Series Data with Seaborn

Lesson 3: Introduction to Twitter Data Mining

  • Exploring Twitter's API

    • Fundamental Twitter Terminology
    • Creating a Twitter API Connection
    import twitter
    
    CONSUMER_KEY = 'LItYWbMkoEt6KRbznvoCQcP5i'
    CONSUMER_SECRET = 'NVhC9F6W77iO99Ut1aJDX7ut0DP8GqHyVFdtEWpRboFlpcvqnv'
    OAUTH_TOKEN = '2870997399-kNPx3rODRpF1YwmxQm9AQ4pIArfoHHgU7M8jR8Y'
    OAUTH_TOKEN_SECRET = 'sh6aEqDKoDJ2biDWtXXH4PCpiR4IBhKvOFLEAEWDRUym1'
    auth = twitter.oauth.OAuth(OAUTH_TOKEN, OAUTH_TOKEN_SECRET, CONSUMER_KEY, CONSUMER_SECRET)
    twitter_api = twitter.Twitter(auth=auth)
    
    print(twitter_api)
    • Exploring Trending Topics
      • get trending topics in the world and the US
      # The Yahoo! Where On Earth ID for the entire world is 1. See https://dev.twitter.com/docs/api/1.1/get/trends/place and     http://developer.yahoo.com/geo/geoplanet/
      WORLD_WOE_ID = 1
      US_WOE_ID = 23424977
      # Prefix ID with the underscore for query string parameterization.
      # Without the underscore, the twitter package appends the ID value
      # to the URL itself as a special case keyword argument.
      world_trends = twitter_api.trends.place(_id=WORLD_WOE_ID)
      us_trends = twitter_api.trends.place(_id=US_WOE_ID)
      
      import json # use json package to pretty print it
      print(json.dumps(world_trends, indent=4), "\n",json. dumps(us_trends, indent=4))
      • get common trends
      # find the intersection of two sets (world trend and US trend)
      world_trends_set = set([trend['name' ] for trend in world_trends[0]['trends' ]])
      us_trends_set = set([trend['name' ] for trend in us_trends[0]['trends' ]])
      common_trends = world_trends_set.intersection(us_trends_set)
      print(common_trends)
    • Searching for Tweets
    import twitter
    
    q = '#illini' # how to use query operators to build a query? see https://dev.twitter.com/rest/public/search 
    count = 10 # count is The number of tweets to return per page
    # See https://dev.twitter.com/docs/api/1.1/get/search/tweets
    search_results = twitter_api.search.tweets(q = q, count = count) # search_results contains two fields: "statuses" and "search_metadata" 
    statuses = search_results['statuses' ]
    # Iterate through 5 more batches of results by following the cursor
    for _ in range(5):
        print("Length of statuses" , len(statuses))
        try:
            next_results = search_results['search_metadata']['next_results'] # get information related to next page
        except KeyError: # No more results when next_results doesn't exist
            break
        # Create a dictionary from next_results, which has the following form:
        # ?max_id=313519052523986943&q=NCAA&include_entities=1
        kwargs = dict([ kv.split('=' ) for kv in next_results[1:].split("&" ) ]) # get the arguments to search the next page
        search_results = twitter_api.search.tweets(**kwargs) # here double stars unpack the values in dict into arguments of the function
        statuses += search_results['statuses' ]
        
    print(json.dumps(statuses[0], indent=4))
  • Analyzing the 140 Characters

    • Extracting Tweet Entities
      • The entities in the text of a tweet are conveniently processed and available through t['entities']
      q = "illini"
      count = 3
      search_results = twitter_api.search.tweets(q = q, count = count) # search_results contains two fields: "statuses" and "search_      metadata" 
      first_status = search_results['statuses'][0] # get the first status
      print("keys = ", first_status.keys()) # show the structure of the first tweet
      
      print("\n\ntext = ", first_status["text"], "\n\nentities = ", first_status['entities'])
    • Analyzing Tweets and Tweet Entities with Frequency Analysis
      • we use collections.Counter to analyze the frequency
    • Computing the Lexical Diversity of Tweets
      • definition of lexical diversity: the number of unique tokens in the text divided by the total number of tokens in the text, a function is as follows
      # A function for computing lexical diversity
      def lexical_diversity(tokens):
          return 1.0*len(set(tokens))/len(tokens) 
    • Examining Patterns in Retweets
    • Visualizing Frequency Data with Histograms
  • introduction to JSON

    • JSON is a lightweight, human-readable text-based open standard data-interchange format
    • JSON vs XML
    • Typical uses of JSON
      • API
      • NoSQL
      • AJAX
      • Package Management
    • JSON in Python
      • json.loads(json_obj), json.dumps(json_obj, indent = 4), json.load(), json.dump()
  • reference

Week 9: Introduction to Data Processing

Lesson 1: Data Formats

  • introduction

  • formatted text

    • fixed-width format
    # First we define our format specification codes
    hfmt = "{0:5s}{1:29s}{2:27s}{3:6s}{4:10s}{5:12s}{6:10s}\n" 
    fmt = "{0:5s}{1:29s}{2:30s}{3:3s}{4:4s}{5:14.8f}{6:14.8f}\n" # {0:5s} means the 0th element is 5-character string, etc
    
    # We need to treat the first row special since it is the header row
    flag = True
    
    # Now open file and write out airports.
    with open('fixed-width.txt', 'w') as fout:
        for row in airports:
            # We output first line special since it is a header row.
            if flag:
                fout.write(hfmt.format(row[0], row[1], row[2], row[3], row[4], row[5], row[6]))
                flag = False
            else:
                fout.write(fmt.format(row[0], row[1], row[2], row[3], row[4], float(row[5]), float(row[6])))
      - a trick to quantify the width of each column: print "1234567890" multiple times
    
    • Delimiter Separated Values
      • using csv module with delimiter option to read and write data
      fout = csv.writer(csvfile, delimiter='|')
  • JSON (see week 8)

  • XML

    • introduction
      • XML is a simple, self-describing structured text-based data format. XML is based on the concept of element, that can have attributes and values. Elements can be nested, which can indicate parent-child relationships or a form of containerization
    • some examples: HTML5, SVG (Scalable Vector Graphics)
    • creating a XML file
    import html 
    import xml.etree.ElementTree as ET
    
    data = '<?xml version="1.0"?>\n' + '<airports>\n'
    for airport in airports[1:]:
        data += '    <airport name="{0}">\n'.format(html.escape(airport[1]))
        data += '        <iata>' + str(airport[0]) + '</iata>\n'
        data += '        <city>' + str(airport[2]) + '</city>\n'
        data += '        <state>' + str(airport[3]) + '</state>\n'
        data += '        <country>' + str(airport[4]) + '</country>\n'
        data += '        <latitude>' + str(airport[5]) + '</latitude>\n'
        data += '        <longitude>' + str(airport[6]) + '</longitude>\n'
        data += '    </airport>\n'
    
    data += '</airports>\n'
    tree = ET.ElementTree(ET.fromstring(data))
    
    with open('data.xml', 'w') as fout:
        tree.write(fout, encoding='unicode')
    • parsing a XML file
    data = [["iata", "airport", "city", "state", "country", "lat", "long"]]
    
    tree = ET.parse('data.xml') # parsing result return a ElementTree object, more info here: https://docs.python.org/3.4/library/xml.etree.elementtree.html
    root = tree.getroot() 
    
    for airport in root.findall('airport'):
        row = []
        row.append(airport[0].text)
        row.append(airport.attrib['name']) # extract attribute "name"
        row.append(airport[1].text)
        row.append(airport[2].text)
        row.append(airport[3].text)
        row.append(airport[4].text)
        row.append(airport[5].text)
    
        data.append(row)
        
    print(data[:5])
  • HDF (Hierarchical Data Format)

    • HDF is a data format that is designed to efficiently handle large data sets that might be difficult to persist by using either database systems, XML documents, or other custom-defined user formats.
  • reference

Lesson 2: Data Parsing

  • introduction to data parsing

    • two techniques used to parse a structured file (like XML)
      • Simple API for XML (SAX): event driven parser that reads and processes each part of XML file sequentially
      • Document Object Model (DOM): reads and parses the entire document
  • introduction to BeautifulSoup

    • quick start
    html_doc = """
    <html><head><title>The Dormouse's story</title></head>
    <body>
    <p class="title"><b>The Dormouse's story</b></p>
    
    <p class="story">Once upon a time there were three little sisters; and their names were
    <a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
    <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
    <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
    and they lived at the bottom of a well.</p>
    
    <p class="story">...</p>
    """
    
    from bs4 import BeautifulSoup
    soup = BeautifulSoup(html_doc) # making the soup from string
    
    print(soup.findAll("p")) # get all paragraphs
    print(soup.find("p")) # find the first paragraphs
    
    for link in soup.findAll("a"):
        print(link.get("href")) # get all urls
    • making the soup
    • kinds of objects
      • tag
      • name (each tag has a name)
      • attributes
      • multi-valued attributes
      • navigableString
    • comments and other special strings
    • Navigating the tree
      • going down
        • Navigating using tag names
        • .contents and .children
        • .descendants: iterate over all tag's children
        • .string
        • .strings and stripped_strings
      • going up
        • .parent vs .parents
      • going sideways
        • .next_sibling and .previous_sibling
        • .next_siblings and .previous_siblings
      • going back and forth
    • searching the tree
      • Kinds of filters
        • the filters could be a string, a regular expression, a list, True, or a function
      • find_all()
        • shortcut: calling a tag is like calling find_all()
        soup("a") # equivalent to soup.find_all("a")
        soup(text = True) # equivalent to soup.find_all(text = True)
        • the name argument
        • the keyword arguments
        • searching by CSS class
        • the text argument: search for strings instead of tags
        • the recursive argument: set to False to consider only the direct children
        • the limit argument: set the max number of results
      • other search functions: find(), find_parents(), find_parent(), etc
      • CSS selectors
    • modifying the tree
    • output
      • pretty-printing and non-pretty printing
      • output formatters
    • specifying the parser to use
    • encodings
      • BeautifulSoup converts encoding to unicode automatically
    • parsing only part of a document
  • reference

Lesson 3: Working with Data

Week 10: Introduction to Data Persistence

Lesson 1: Rational Databases

  • introduction to data persistence

    • data persistence using basic file input/output in Python
      • method: directly write and read data in string form into/out of a file
      • not optimal, for following reasons
        • not convenient to convert some types of data into strings
        • costly in terms of storage space
        • rely completely on the underlying file system consistency and durability
    • pickling
      • using pickle package and binary reading/writing mode
      import numpy as np
      import pickle
      
      # writing
      data = np.random.rand(100)
      
      with open('test.p', 'wb') as fout:
          pickle.dump(data, fout)
      
      # reading
      with open('test.p', 'rb') as fin:
          newData = pickle.load(fin)
      
      print(newData[0:20:4])
      • While easier than custom read/write routines, pickling still requires the file system to provide support for concurrency, consistency, and durability. To go any further with data persistence, we need to move beyond Python language constructs and employ additional software tools.
    • database systems
      • classification
        • Relational Database Management Systems
          • rely on a tabular data model
          • e.g. MySQL, PostgreSQL, etc
        • NoSQL systems
          • not rely on the tabular data model
          • many are developed to meet the big data challenges by Google, Facebook etc
          • e.g. Dynamo, ZopeDB, MongoDB, etc
      • database roles
        • database administrator, database developer, database application developer
      • The ACID Test
      • SQLite
        • SQLite is a software library that implements a self-contained, serverless, zero-configuration, transactional SQL database engine. SQLite is the most widely deployed SQL database engine in the world.
        • reference
  • reference

Lesson 2: Using SQL for Schema Manipulation

  • introduction to SQL

    • the basics of relational database systems
      • relational database hold data of different types
      • style convention: all SQL commands are presented entirely in uppercase, and item names use camelCase.
      • Related tables are often grouped together into a schema
    • brief history of SQL
      • initially many kinds of database systems have different APIs. SQL is developed as a standard language to access and manipulate them (but there are some different versions of SQL)
    • two main components in SQL
      • Data Definition Language (DDL): used to create, modify, or delete items (such as tables) in a database
      • Data Manipulation Language (DML): used to add, modify, delete, or select data from a table in the database
    • SQL data types
      • While the SQL standard defines basic data types, different database systems can support the standard to varying degrees. SQLite supports: NULL, INTEGER, REAL, TEXT, BLOB
    • create table
      • syntax: CREATE TABLE tableName ( { <columnDefinition> | <tableLevelConstraint> } [, { <columnDefinition> | <tableLevelConstraint> } ]* );
      • explanation
        • "|" means "either/or"
        • content between "[" and "]" is optional
        • "*" indicates that multiple enclosing items can be included
    • drop table
      • DROP TABLE tableName ;
      • an example of creating and droping tables
      %%writefile create.sql
      -- First we drop any tables if they exist
      -- Ignore the no such Table error if present
        
      DROP TABLE myVendors ;
      DROP TABLE myProducts ;
          
      -- Vendor Table: Could contain full vendor contact information.
              
      CREATE TABLE myVendors (
          itemNumber INT NOT NULL,
          vendornumber INT NOT NULL,
          vendorName TEXT
      ) ;
         
      -- Product Table: Could include additional data like quantity
             
      CREATE TABLE myProducts (
          itemNumber INT NOT NULL,
          price REAL,
          stockDate TEXT,
          description TEXT
      ) ;
    • SQL script
      • we can write SQL commands into a script file for execution. this technique is useful for debugging and command reuse.
    • reference
  • SQL tutorials

    • basic statements
      • SELECT
      • DISTINCT
      • WHERE
      • AND/OR
      • ORDER BY
      • INSERT INTO
      • UPDATE...SET
      • DELETE
      • SELECT TOP
      • LIKE (together with SQL wildcard characters)
      • IN, BETWEEN
      • SELECT ... AS (aliases)
      • INNER JOIN, LEFT/RIGHT JOIN, FULL OUTER JOIN
      • SELECT INTO (create a new table), INSERT INTO ... SELECT (insert into an existing table)
      • CREATE DATABASE
      • CREATE TABLE (specifying column_names, data_types and optional_constaints)
        • some constaints: NOT NULL, UNIQUE, PRIMARY KEY, FOREIGN KEY, CHECK, DEFAULT, AUTO INCREMENT
      • CREATE INDEX
      • DROP INDEX/TABLE/DATABASE
      • ALTER TABLE
      • CREATE VIEW
      • GROUP BY
      • Date functions
      • Null functions
    • general data types & data types for various databases
    • SQL injection
      • a technique where malicious users can inject SQL commands into an SQL statement, via web page input.
    • SQL functions
      • aggregate functions
        • AVG(), COUNT(), FIRST(), LAST(), MAX(), MIN(), SUM()
      • scalar functions
        • UCASE(), LCASE(), MID(), LEN(), ROUND(), NOW(), FORMAT()
    • reference
  • SQLite tutorials

Lesson 3: Using SQL for Data Manipulation

Week 11: Advanced Data Persistence

Lesson 1: Python Database Programming

  • introduction

    • previously, we use a dababase client tool to do operations on a database, now we are going to interact with databases using a python program. We will mainly focus on SQLite, for other databases, we can refer to related API.
  • using SQLite

    • to work with a database within a Python program, we follow three steps
      • establish a connection to a databse. in most cases, we need to connect to a remote database server through a network connection, for SQLite, we can work locally
      • obtain a cursor from the database connection
      • execute SQL commands by using the database cursor
    • useful methods
      • execute(), executemany(), executescript()
      • fetchone(), fetchmany(), fetchall()
    • example
    # We define our Create Table SQL command
    createSQL = '''
    CREATE TABLE myProducts (
        itemNumber INT NOT NULL,
        price REAL,
        stockDate TEXT,
        description TEXT);
    '''
    
    # Tuple containing data values to insert into our database
    items = ((1,19.95,'2015-03-31','Hooded sweatshirt'), 
             (2,99.99,'2015-03-29','Beach umbrella'),
             (3,0.99,'2015-02-28', None),
             (4,29.95,'2015-02-10','Male bathing suit, blue'),
             (5,49.95,'2015-02-20','Female bathing suit, one piece, aqua'),
             (6,9.95,'2015-01-15','Child sand toy set'),
             (7,24.95,'2014-12-20','White beach towel'),
             (8,32.95,'2014-12-22','Blue-striped beach towel'),
             (9,12.95,'2015-03-12','Flip-flop'),
             (10,34.95,'2015-01-24','Open-toed sandal'))
    
    # user-defined function
    def myYear(date):
        return int(date[:4])
    
    # Open a database connection, here we use an in memory DB
    
    with sl.connect(":memory:") as con: # ':memory:' indicates that our database will be temporary and maintained in the program's memory space
    
        # Now we obtain our cursor
        cur = con.cursor()   
        
        # First we create the table
        cur.execute(createSQL)
        
        # Now populate the table using all items
        cur.executemany("INSERT INTO myProducts VALUES(?, ?, ?, ?)", items) 
        # two kinds of placeholders for holding value:
        # 1. put ? as a placeholder wherever you want to use a value
        # 2. use named placeholder starting with ":" like ":name", 
        # e.g. cur.execute("INSERT INTO myProducts VALUES(:id, :price, :sdate, :desc)", {"id" : item[0], "price" : item[1], "sdate" : item[2], "desc" : item[3]})
        
        con.create_function("fYear", 1, myYear) # user-defined function
        
        for row in cur.execute('SELECT fYear(stockDate) FROM myProducts'): # apply function in a query
            print(row)
  • reference

Lesson 2: Advanced Python Database Programming

  • Introduction to Pandas & Databases

    • read_sql(): Read SQL query or database table into a DataFrame, it is a convenient wrapper around read_sql_table() and read_sql_query() depending on the provided input
    import sqlite3 as sl, pandas as pd
    
    query = "SELECT code, airport, city, state, latitude, longitude FROM airports LIMIT 100 ;"
    
    with sl.connect(database) as con:
        data = pd.read_sql(query, con, index_col ='code')
        print(data[data.state == 'MS'])
    • to_sql(): Write records stored in a DataFrame to a SQL database.
    query = "SELECT code, airport, city, state, latitude, longitude FROM airports ;"
    with sl.connect(database) as con:
        data = pd.read_sql(query, con)
    
        data[data.state == 'IL'].to_sql('ILAirports', con)
  • comparison between pandas and SQL

    • SELECT: indexing in pandas
    • WHERE: boolean indexing and NULL check
    • GROUP BY: groupby() function
    • JOIN: join() or merge()
    • UNION: concat()
  • reference

Lesson 3: NoSQL Data Storage

Week 12: Data Exploration

Lesson 1: Data Preparation

  • introduction to data preparation

    • before we can explore and analyze a data set, we probably need to do some preparation work, because many data sets may contain lots of dirty data and are not ready to use, or they may be in bad form which would consume lots of system resources.
    • example
      • main idea
        • previously we have already construct a database using flights data in 2001.csv.
        • If we want to load all the data, we will probably run out of memory.
        • Even if we only want two columns named arrivalDelay and departureDelay, we will still use too much memory. Main reason is that the Pandas DataFrame needs to store the data in the format of object when there are "NA" value, which consumes much more memory than int64 format.
        • so in order to save memory, we drop "NA" before reading into a DataFrame by using
        query = '''SELECT arrivalDelay, departureDelay 
            FROM flights
            WHERE arrivalDelay != 'NA' AND
                departureDelay != 'NA' ; '''
        
        with sl.connect(database) as con:
            data = pd.read_sql(query, con)
        • to further save memory, we examine the data by data.describe() and find that it is enough to use int16 format, so we convert data into this format by using
        import numpy as np
        data[['arrivalDelay']] = data[['arrivalDelay']].astype(np.int16)
        data[['departureDelay']] = data[['departureDelay']].astype(np.int16)
        • also we can read data directly from original csv file instead of the database by using
        newdata = pd.read_csv('/notebooks/i2ds/data/2001.csv', dtype=np.float, header=0, na_values=['NA'], usecols=(14, 15)) # use na_values to define NA data
        newdata = newdata.dropna() # drop NA data in the dataFrame
        • we can save the data in hdf format by using to_hdf() function
      • details here: http://nbviewer.ipython.org/github/INFO490/spring2015/blob/master/week12/intro2de.ipynb
  • pandas reading and writing HDF files

Lesson 2: Visual Exploration

Lesson 3: Statistical Exploration

Week 13: Introduction to Machine Learning

Lesson 1: Supervised Learning

Lesson 2: Unsupervised Learning

  • introduction to dimension reduction

    • introduction
      • for a large, multi-dimensional data set, one approach to simplify subsequent analysis is to reduce the number of dimensions that must be processed. In some cases, dimensions can be removed from analysis based on business logic. More generally, however, we can employ machine learning to seek out relationships between the original dimensions to identify new dimensions that better capture the inherent relationships within the data.
      • The standard technique to perform this is known as principal component analysis (PCA). Mathematically,we can derive PCA by using linear algebra to solve a set of linear equations. This process effectively rotates the data into a new set of dimensions, and by ranking the importance of the new dimensions, we can actually leverage fewer dimensions in machine learning algorithms.
    • reference
  • dimension reduction with Principal Component Analysis

    • basic idea
      • introduction example
        • use sklearn.decomposition.PCA to read a dataset to find the components with maximum variance (principal components)
        • keep 95% of variance and data is reduced by 1 dimension and the size of data is compressed by 50%
      • Application of PCA to Digits
        • every digit figure has 64 pixels, which means 64 dimensions.
        • one way to see components is to think of them as different pixels, but in this way, we cannot get good approximation with a few number of components
        • instead, we find the best series of basis to approximate the digit figure using machine learning techniques, so that we can get very good approximation with less than 6 components. The dimensionality is reduced from 64 to 6 or less.
        • this is some kind of lossy data compression
    • reference
  • Decomposing signals in components (matrix factorization problems)

    • principal component analysis
      • exact PCA
      • incremental PCA
        • do not require to load all data into memory, allow partial computations
      • approximate PCA
        • greatly reduce time complexity
      • kernal PCA
        • an extension of PCA which achieves non-linear dimensionality reduction through the use of kernels
      • SparsePCA and MiniBatchSparsePCA
    • Truncated singular value decomposition and latent semantic analysis
    • Dictionary Learning
    • factor analysis
    • independent component analysis (ICA)
    • Non-negative matrix factorization (NMF or NNMF)
    • reference

Lesson 3: Clustering

Week 14: Introduction to Cloud Computing

Lesson 1: Introduction to Cloud Computing

  • Introduction to Cloud Computing with Docker

    • docker overview
      • docker daemon process
      • docker images
      • docker containers
        • a docker container can run in either detached mode (in the background) or foreground mode (through network connection or shared volumes)
    • docker volumes
      • mount a volume from host to docker container
      sudo docker tag sequenceiq/hadoop-ubuntu:2.6.0 had # add a tag to simplify typing
      sudo docker run -it -v /data:/file had /bin/bash # mount /data in host to /file in the container
    • linking containers
    • advanced docker commands
      • cp: used to copy data into a running Docker container from the host operating system.
      • history: displays the history of a Docker image.
      • info: displays system-wide docker information.
      • restart: used to restart a stopped container.
      • rm: remove docker container, use -f flag to force removal
      • rmi: remove a Docker image, use the -f flag to force removal
      • search: search the Docker official registry for specific Docker images.
      • stats: used to monitor the system resources used by a running container.
      • stop: used to stop a running Docker container.
      • tag: used to add tags, like a new, human-readable name to a image or container.
      • top: used to monitor usage of a running container.
    • docker resource usage
    • docker cleanup
      • sudo docker rm $(sudo docker ps -a -q)
    • reference
  • introduction to cloud computing by Amazon

    • how does AWS work?
      • get access to servers, storage, databases, apps over the internet
    • advantages
      • Trade capital expense for variable expense
      • Benefit from massive economies of scale
      • Stop guessing capacity
      • Increase speed and agility
      • Stop spending money on running and maintaining data centers
      • Go global in minutes
    • type of cloud computing
      • Infrastructure as a Service (IaaS), Platform as a Service (PaaS), Software as a Service (SaaS).
    • reference

Lesson 2: Running Containers in the Cloud

Lesson 3: Introduction to Hadoop

  • introduction to Hadoop

Week 15: Introduction to Python High Performance Computing

Lesson: Introduction to Python High Performance Computing

  • Optimizing Python Performance
    • introduction
      • warning: one should not worry about optimization until it has been shown to be necessary
    • related Python modules
      • threading module
        • we can use threading.Thread() to create a Thread object with a specific name and a function to execute and then start it
      • multiprocessing module
        • the standard Python interpreter only allows one thread to execute Python code at a time, this si called Global Interpreter Lock (GIL). One way to circumvent it is to use multiple Python interpreters that each runs in their own process
        • these processes all share the same Python code (which needs to create other processes), so how can we avoid the infinite loop of creating processes? by specifying if __name__ = '__main__' , so that only __main__ process creates other processes
    • IPython cluster
    • third-party Python tools
      • Numba
      • PYPY
      • Cython
    • reference

About

Course INFO 490 at UIUC


Languages

Language:Jupyter Notebook 99.8%Language:Python 0.2%