mjs2600 / ML-Final-Exam-Study-Notes

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Content Overview/Important Concepts

More detailed info and reading can be found in the repo subdirectories

Wiki Study Guide

http://wiki.omscs.org/confluence/display/CS7641ML/CS7641.FA14.+Final+exam+prep

Unsupervised Learning

UL consists of algorithms that are meant to "explore" on their own and provide the user with valuable information concerning their dataset/problem

  • Randomized Optimization

  • Clustering

    • Single Linkage
      • Consider each of n points a cluster
      • Find the distance between the closest two points in every cluster
      • Merge the closest two clusters
      • Repeat n - k times to get k clusters
      • Method:
        1. Consider each point a cluster
        2. Merge two closest clusters
        3. Unless we have K clusters GOTO 2
      • Links points closest to each other.
      • Can result in "stringy" non-compact clusters
    • K-Means
      • Each iteration is polynomial
      • Finite (exponential) iterations in theory, but usually much less in practice
      • Always converges, but can get stuck with "weird" clusters depending on random starting state
        1. Place k centers
        2. Claim closest points
        3. find the centers of the points
        4. Move the centers to the clusters of points
        5. Unless converged GOTO 2
    • Expectation Maximization
      • Gaussian Means
      • Uses expectation and maximization steps
      • Monotonically non-decreasing likelihood
      • Does not converge (practically does)
      • Can get stuck
      • Works with any distribution (not just Gaussian)
    • Properties of Clustering Algorithms (Pick 2)
      • Richness
      • Scale Invariance
      • Consistency
    • Richness
      • For any assignment of objects to clusters, there is some distance matrix, D, such that P_D returns that clustering
    • Scale-Invariance
      • Scaling distances by a positive value does not change the clustering
    • Consistency
      • Shrinking intra-cluster distances and expanding intercluster distances does not change the clustering.
    • No clustering scheme can acheive all of Richness, Scale-Invariance, Consistency
  • Feature Selection

    • Filtering
      • Choose features independent of learner. i.e. "filter" the data before it is passed to the learner
      • Faster than wrapping (don't have to pay the cost of the learner)
      • Tends to ignore relationships between features
      • Decision Trees do this naturally (Filter on information gain)
    • Wrapping
      • "Wrap" the learner into the feature selection. Choose features based on how the learner performs.
      • Takes into account learner bias
      • Good at determining feature relationships (as they pertain to the success of the learner)
      • Very slow (have to run the learner for each feature search)
      • Speed Ups
    • Relevance
      • x_i is strongly relevant if removing it degrades the Bayes' Optimal Classifier
      • x_i is weakly relevant if
        • it is not strongly relevant
        • There exists a subset of features S such that adding x_i to S improves Bayes' Optimal Classifier
        • x_i is otherwise irrelevant
    • Relevance vs. Usefulness
      • Relevance measures the effect the variable has on the Bayes' Optimal Classifier
      • Usefulness measures the effect the variable has on the error of a particular predictor (ANN, DT, etc.)
  • Feature Transformation

    • Polsemy: Same word different meaning - False Positives
    • Synonomy: Different word same meaning - False Negatives
    • PCA: Good Slides
      • Example of an eigenproblem
      • Finds direction (eigenvectors) of maximum variance
      • All principal components (eigenvectors) are mutually orthogonal
      • Reconstructing data from the principal components is proven to have the least possible L2 (squared) error compared to any other reduction
      • Eigenvalues are monotonically non-increasing and are proportional to variance along each principal component (eigenvector). Eigenvalue of 0 implies zero variance which means the corresponding principal component is irrelevant
      • Finds "globally" varying features (image brightness, saturation, etc.)
      • Fast algorithms available
    • ICA
      • Finds new features that are completely independent (from each other). i.e. they share no mutual information
      • Attempts to maximize the mutual information between the original and transformed data. This allows original data to be reconstructed fairly easily from the transformed data.
      • Blind Source Separation (Cocktail Party Problem)
      • Finds "locally" varying features (image edges, facial features)
    • RCA
      • Generates random directions
      • It works! If you want to use it to preprocess classification data...
        • Is able to capture correlations between data, but in order for this to be true, you must often reduce to a larger number of components than with PCA or ICA.
      • Can't really reconstruct the original data well.
      • Biggest advantage is speed.
    • LDA
      • Requires data labels
      • Finds projections that discriminate based on the labels. i.e. separates data based on class.
  • Information Theory

Reinforcement Learning

Reinforcement Learning: A Survey

Put an agent into a world (make sure you can describe it with an MDP!), give him some rewards and penalties and hopefully he will learn.

  • Markov Decision Processes

    • Building a MDP
      • States
        • MDP should contain all states that an agent could be in.
      • Actions
        • All actions an agent can perform. Sometimes this is a function of state, but more often it is a list of actions that could be performed in any state
      • Transitions (model)
        • Probability that the agent will arrive in a new state, given that it takes a certain action in its current state: P(s'|s, a)
      • Rewards
        • Easiest to think about as a function of state (i.e. when the agent is in a state it receives a reward). However, it is often a function of a [s, a] tuple or a [s, a, s'] tuple.
      • Policy
        • A list that contains the action that should be taken by the agent in each state.
        • The optimal policy is the policy that maximizes the agent's long term expected reward.
    • Utility
      • The utility of a state is the reward at that state plus all the (discounted) reward that will be received from that state to infinity.
      • Accounts for delayed reward
      • Described by the Bellman Equation
        • Bellman Equation
    • Value Iteration
      • "Solve" (iteratively until convergence, more like hill climb) Bellman Equation.
      • When we have maximum utility, the policy which yields that utility can be found in a straightforward manner.
    • Policy Iteration
      • Start with random (or not) initial policy.
      • Evaluate the utility of that policy.
      • Update policy (in a hill climbing-ish way) to the neighboring policy that maximizes the expected utility.
    • Discount Factor, gamma (typically between 0 and 1), describes the value placed on future reward. The higher gamma is, the more emphasis is placed on future reward.
  • Model-Based vs. Model-Free

    • Model-Based requires knowledge of transition probabilities and rewards
      • Policy Iteration
      • Value Iteration
    • Model-Free gets thrown into the world and learns the model on its own based on "[s, a, s', r]" tuples.
      • Q Learning
  • Three types of RL

    • Policy Search - direct use, indirect learning
    • Value function based - ^Argmax
    • Model based - indirect use, direct learning ^Solve Bellman
  • Q Learning

    • Q Function is a modification of the Bellman Equation
      • Q Function
      • U(s)
      • Pi(s)
    • Learning Rate, alpha, is how far we move each iteration.
    • If each action is executed in each state an in finite number of times on an infinite run and alpha is decayed appropriately, the Q values will converge with probability 1 to Q*
    • Exploration vs Exploitation
      • Epsilon Greedy Exploration
        • Search randomly with some decaying probability like Simulated Annealing
      • Can use starting value of Q function as a sort of exploration
  • Game Theory

    • Zero Sum Games
      • A mathematical representation of a situation in which each participant's gain (or loss) of utility is exactly balanced by the losses (or gains) of the utility of the other participant(s).
    • Perfect Information Game
      • All agents know the states of other agents
      • minimax == maximin
    • Hidden Information Game
      • Some information regarding the state of a given agent is not know by the other agent(s)
      • minimax != maximin
    • Pure Strategies
    • Mixed Strategies
    • Nash Equilibrium
      • No player has anything to gain by changing only their own strategy.
    • Repeated Game Strategies
      • Finding best response against a repeated game finite-state strategy is the same as solving a MDP
      • Tit-for-tat
        • Start with cooperation for first game, copy opponent's strategy (from the previous game) every game thereafter.
      • Grim Trigger
        • Cooperates until opponent defects, then defects forever
      • Pavlov
        • Cooperate if opponent agreed with your move, defect otherwise
        • Only strategy shown that is subgame perfect
    • Folk Theorem: Any feasible payoff profile that strictly dominates the minmax/security level profile can be realized as a Nash equilibrium payoff profile, with sufficiently large discount factor.
      • In repeated games, the possibility of retaliation opens the door for cooperation.
      • Feasible Region
        • The region of possible average payoffs for some joint strategy
      • MinMax Profile
        • A pair of payoffs (one for each player), that represent the payoffs that can be achieved by a player defending itself from a malicious adversary.
      • Subgame Perfect
        • Always best response independent of history
      • Plausible Threats
    • Zero Sum Stochastic Games
      • Value Iteration works!
      • Minimax-Q converges
      • Unique solution to Q*
      • Policies can be computed independently
      • Update efficient
      • Q functions sufficient to specify policy
    • General Sum Stochastic Games
      • Value Iteration doesn't work
      • Minimax-Q doesn't converge
      • No unique solution to Q*
      • Policies cannot be computed independently
      • Update not efficient
      • Q functions not sufficient to specify policy

About