Rogercli / DS-MasterNotes

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Data Science - MasterNotes



# Comprehension 
[ num for num in range(100) ]

# Comprehension with condition
[  num for num  in range(100)  if num   > 5    ]
[ char for char in expression  if char in "()" ]

def factorial(n):
    prod = 1
    for num in range(n+1):
        prod *= num
    return prod
matrix = [[2, 1, 5], 
          [9, 2, 8], 
          [1, 7, 3]]

2 1 5
9 2 8
1 7 3
for row in zip(matrix[0], matrix[1], matrix[2]):
    print(row)
    
(2, 9, 1)
(1, 2, 7)
(5, 8, 3)

list(zip(*matrix))
[ (2, 9, 1), (1, 2, 7), (5, 8, 3) ]
[[*tup] for tup in zip(*matrix)] or
[list(tup) for tup in zip(*matrix)]

[ [2, 9, 1], [1, 2, 7], [5, 8, 3] ]

2 9 1
1 2 7
5 8 3


def mean(lst, trim=0):
    lst_ = lst.copy()
    if trim > 0:
        lst_ = sorted(lst_)[trim:-trim]
    return sum(lst_) / len(lst_)
def median(lst):
    lst_sorted = sorted(lst)
    mid = int(len(lst) / 2)
    # odd
    if len(lst) % 2:
        return lst_sorted[mid]
    else:
        return mean([lst_sorted[mid-1], lst_sorted[mid]])
def mode(lst):
    dict_counter = {}
    for item in lst:
        if item in dict_counter.keys():
            dict_counter[item] += 1
        else:
            dict_counter[item] = 1
    max_freq = max(list(dict_counter.values()))
    modes = [item for item, freq in dict_counter.items() if freq == max_freq]
    
    if len(modes) == len(lst):
        return None
    else:
        return modes

  • A measure of central tendency is a single value that attempts to describe a set of data by identifying the central position within that set of data.
  • As such, measures of central tendency are sometimes called measures of central location.
  • They are also classed as summary statistics.

The Median is Resistant to Outliers

  • The primary difference between the mean or median is their levels of resistance to outliers.
  • The mean is not very resistant to outliers, especially when dealing with a dataset that has non-symmetric outliers.

If a collection has extreme outliers, the mean may describe the distribution "center" inaccurately. A classic example of this is when looking at household incomes. Households with far greater incomes skew the mean to the point where it no longer accurately describes the dataset.

Example

Consider the incomes of the following ten households. By calculating both the mean and median, it is possible to make a determination as to which of these two statistics describes the incomes most accurately.

$$ A = [\quad$30,000,\quad$35,000,\quad$41,000,\quad$45,000,\quad$50,000,\quad$57,000,\quad$57,500,\quad$59,000,\quad$60,000,\quad$457,000\quad] $$ $$ mean = \mu = $89,000 \qquad median = \tilde x = $53,500 $$

Solution The mean of the household incomes is $89,150 and the median is $53,500. Here the median does a better job of describing a typical household income from the collection. The mean is greatly skewed by a single income that is far greater than the others. The mean implies that a typical household would have over $89,000 of income, despite there being only one household with an income greater than $60,000.

The Mean is Preferable in Large Datasets with Few Outliers

  • There are some situations where the mean is considered a preferable measure to median; typically these are situations in which there are a large number of items in the collection, and there are not any outliers (or the outliers are symmetric).
  • Also, inferential statistics are largely built upon measurements of the mean, so it is the statistic which is used most often.

Mode is Preferable When Using Categorical Data

  • In a collection with categorical data that is (generally) not ordinal in nature, the mode is the best measure of center, though the use of the term "center" may be taking a bit of liberty.

  • The mode can also be a useful descriptive statistic when there isn't one single central concentration of values.

A common example of this would be the weights of household pets. If one were to take a sample of housepet weights, there would likely be a concentration of cats, each weighing between eight and twelve pounds, and a concentration of dogs weighing between twenty and thirty five pounds. The mean or median may tell us that a typical household pet weighs fifteen pounds, but that description doesn't accurately describe the typical weight of either cats or dogs. A distribution such as this is often referred to as bi-modal.

Type of Variable Best Measure of Central Tendency
Nominal Mode
Ordinal Median
Interval/Ratio (not skewed) Mean
Interval/Ratio (skewed) Median

Skewness refers to a distortion or asymmetry that deviates from the symmetrical bell curve, or normal distribution, in a set of data. If the curve is shifted to the left or to the right, it is said to be skewed.



The five number summary gives a more in-depth description of a numerical collection of values. In addition to identifying a measure of center `median`, it gives us more insight into the way the values are distributed. The five number summary consists of the following values:

Five Number Summary

The five-number summary is a set of descriptive statistics that provides information about a dataset. It consists of the five most important sample percentiles

  • The minimum
  • The lower (first) quartile: $Q_1$
  • The median
  • The upper (third) quartile $Q_3$
  • The maximum

The values are often expressed in a tuple, as follows

$ (\quad min,\quad Q_1,\quad median,\quad Q_3,\quad max\quad)$


def five_number_summary(lst):
    sorted_list = sorted(lst)
    lower_half = sorted_list[0: int(len(lst) / 2) + (len(lst) % 2)]
    upper_half = sorted_list[int(len(lst) / 2): ]
    
    q1 = median(lower_half)
    q3 = median(upper_half)
    
    return min(lst), q1, median(lst), q3, max(lst)

def iqr(lst):
    _, q1, _, q3, _ = five_number_summary(lst)
    
    return q3 - q1

def detect_outliers(lst, outlier_coef=1.5):
    outliers = []
    _,q1,_,q3, _ = five_number_summary(lst)
    iqr_ =iqr(lst)
    
    for num in lst:
        if num < q1 -iqr_*outlier_coef or num > q3 + outlier_coef*iqr_:
            outliers.append(num)

    return outliers
    
a = [-500,12,32,54,45,87,89,61,31,12549] 

print(detect_outliers(a,1.5)) # [-500, 12549]

def remove_outliers(lst, outlier_coef=1.5):
    outliers = detect_outliers(lst, outlier_coef)
    output = lst.copy()
    
    for num in outliers:
        if num in output:
            output.remove(num)
            
    return output

a =  [590, 615, 575, 608, 350, 1285, 408, 540, 555, 679]
print(remove_outliers(a)) # [590, 615, 575, 608, 540, 555, 679]

  • The purpose of both the variance and standard deviation statistics are to express an easily interpretable measure of spread in a collection.

  • The variance can be interpreted as the average squared deviations of each number from the mean, and it is calculated as such.

  • The reason why we square the deviations is so we can deal with only positive values, If we didn't square the values our variation would end up being zero for every distribution

  • Population Variance

$$\sigma^2 = \frac{1}{n} \sum_{i=1}^{n} (x_i - \mu)^2$$

  • Sample Variance

$$ s^2 = \frac{1}{n-1} \sum_{i=1}^{n} (x_i - \overline x)^2 $$

  • Recall

    • $\mu$ : population mean
    • $\overline x$ : sample mean
  • You can see the two formulas for variation are very similar, the primary difference being that the population variance is averaged by dividing by $n$
  • In the computation of a sample standard deviation, we use $n-1$. The Bessel's correction
  • This correction is made, because it partially corrects the bias in the estimation of a population variance. Bessel's

Example 1

Find the variance of the following population $A$ assume all measurements are in inches:
$$ A = [\quad 73,\quad 65,\quad72,\quad74,\quad69,\quad70,\quad72,\quad73\quad] $$

Step 1 : Find the mean of $A$

$$ \mu = \frac{73+65+72+74+69+70+72+73}{8} = 71 $$

Step 2 : Find the sum of the squared differences.

$$ \sum_{i=0}^8 (x_i - \mu)^2 \quad = \quad (73-71)^2 + (65-71)^2 + \dots + (73-71)^2 \quad = \quad 60 $$

Step 3 : Divide the sum above by $n$ (or multiply by \frac{1}{n})

$$ o^2 = \frac{60}{8} = 7.5 $$

Solution We can see here that our population variance is $7.5 inches^2$. It is important to note here that a variation calculated will always result in terms of the original unit squared. This leaves something to be desired in terms of interpretability; we'll discuss that in the second half of this lesson when dealing with standard deviations.


Example 2

Calculate the variance for the same numerical collection above, this time assuming it is a sample, call the sample dataset $B$.

$$ B = [\quad 73,\quad 65,\quad72,\quad74,\quad69,\quad70,\quad72,\quad73\quad] $$

Step 1 : Find the mean of $B$

$$ \bar x = \frac{73+65+72+74+69+70+72+73}{8} = 71 $$

Step 2 : Find the sum of the squared differences.

$$ \sum_{i=0}^8 (x_i - \mu)^2 \quad = \quad (73-71)^2 + (65-71)^2 + \dots + (73-71)^2 \quad = \quad 60 $$

Step 3 : Divide the sum above by $n$ (or multiply by \frac{1}{n})

$$ o^2 = \frac{60}{7} = 8.571 $$

Solution The variance of the sample dataset $B$ is $8.751$, larger than the population's variance.

A note about the application of Bessel's correction:

The difference in the variances between the sample and the population are a byproduct of applying Bessel's correction. In short, when one finds the variance of a population, they are sure to include all possible outliers. In contrast, when sampling from a population there is a chance that very few (or none!) outliers will end up in the sample dataset. Because of this the variance will likely be smaller than the true variance of the population. Because the object is to make inferences about a population from a sample, the application of Bessel's correction makes the variance from a sample more likely to be accurately representative of the population.



  • Population Variance

$$\sigma^2 = \frac{1}{n} \sum_{i=1}^{n} (x_i - \mu)^2$$

  • Sample Variance

$$ s^2 = \frac{1}{n-1} \sum_{i=1}^{n} (x_i - \overline x)^2 $$

  • Recall

    • $\mu$ : population mean
    • $\overline x$ : sample mean
def variance(lst, sample=True):
    mean_ = mean(lst)
    total = 0
    for item in lst:
        total += (item - mean_)**2
    return total / (len(lst) - sample)

Population and Sample

  • As we mentioned above, the variance does a good job of describing the spread of a population or sample.

  • However imagining the average spread in terms of the original units squared can be difficult to interpret.

  • Because of this, we typically take the square root of our variance, this yields us a standard deviation.

  • The standard deviation ends up being in the same units as the original data.

  • A standard deviation can be informally interpreted as: "a typical item from this collection can be expected to have the value of the mean plus or minus the standard deviation."

  • This is formally defined by the empirical rule, or the 68/95/99 rule; we won't go into great detail about this rule now, but it will be covered later in the statistics block.


Notations :

$\sigma\quad :$ lowercase sigma is used for the standard deviation of a population

$s\quad :$ lowercase $s$ is typically used to representation the standard deviation of a sample

$sd\quad :$ the combination of lowercase $sd$ is also commonly used for both standard deviations



  • Population Standard Deviation:

$$ \sigma = \sqrt{\frac{1}{n} \sum_{i=1}^{n} (x_i - \mu)^2} $$

  • Sample Standard Deviation:

$$ s = \sqrt{\frac{1}{n-1} \sum_{i=1}^{n} (x_i - \overline x)^2} $$

from math import sqrt

def stdev(lst, sample=True):
    return sqrt(variance(lst, sample))


  • In mathematics, a set is a well-defined collection of objects
  • A set is also an object in itself
  • Sets must be comprised of unique objects: NO DUPLICATES
  • If the outcome of a random experiment is unknown, and all of the possible outcomes are predictable in nature, this set of outcomes is known as the Sample Space, notated with a capital $S$, or as the “Universal Set”, denoted $U$, or $\Omega$ (capital omega)
def dedupe_in_order(lst):
    deduped = []

    for element in lst:
        if element not in deduped:
            deduped.append(element)

    return deduped

  • The union of two sets is a new set that contains all of the elements that are in at least one of the two sets.

  • Common Notation for the union of events A and B:

  • A ∪ B

  • There is a distinct relationship between the set theory definition of union, and the logical operator OR.

def union(set1, set2):
    set_union = set1.copy()
    for item in set2:
        if item not in set_union:
            set_union.append(item)
    return set_union

The union can be extrapolated to more than two events

  • Common Notation multiple events:

  • A ∪ B ∪ C

  • A ∪ B ∪ C ∪ D

    • NOTE: The order of the union operation does not matter
def union_mult_sets(*mult_sets):
    set_union = []
    for lst in mult_sets:
        for item in lst:
            if item not in set_union:
                set_union.append(item)
    return set_union

  • The intersection of two sets is a new set that contains all of the elements that are members of both sets which comprise the intersection
  • Common Notation for the intersection of events A and B:
  • AB or A ∩ B
  • There is a distinct relationship between the set theory definition of intersection, and the logical operator AND.
def intersection(a,b):
    intersected = []
    for item in a:
        if item in b:
            intersected.append(item)
    return intersected
def intersection_mult(*mult_sets):
    set_intersect = []
    if len(mult_sets) > 1 and len(mult_sets[0]) > 0:
        for item in mult_sets[0]:
            is_member = True
            for set_ in mult_sets[1:]:
                if item not in set_:
                    is_member = False
                    break
            if is_member:
                set_intersect.append(item)
    return set_intersect

  • Set Difference is anything in one set that isn’t the other.
    • Syntax: A\B, A-B, A.difference(B)

    • Example: A = {1, 2, 3, 4, 5} B = {5, 6, 7, 8, 9} A - B = {1, 2, 3, 4} B - A = {6, 7, 8, 9}

def difference(set1, set2):
    set_difference = []
    for item in set1:
        if item not in set2:
            set_difference.append(item)
    return set_difference

  • The complement of a set is the set which represents all members of the sample space which are not in the event.
  • Common Notation for the complement of events A and B:
  • A’ or Ac or A0 or Ā or ¬A or ~A
  • There is a distinct relationship between the complement and the logical operator NOT
def complement(sample_space, set1):
    return difference(sample_space, set1)


Equating Set Algebra Laws with Boolean Logic

  • Consider the concept of True as being a logical descriptor for a set $A$ containing $n$ elements.
  • In this sense, all the above laws will apply to both Sets and Boolean operations

Set Operator Python Boolean Operator
Union or
Intersection and
Complement not

Commutative


  • A ∪ B = B ∪ A
  • AB = BA

Set Logic

set1 = {'a', 'b', 'c'}
set2 = {'c', 'd', 'e'}

print(set1.union(set2) == set2.union(set1)) # --> True
print(set1.intersection(set2) == set2.intersection(set1)) # --> True

Boolean Logic

a = True
b = False

print( (a or b) == (b or a) ) # --> True
print( (a and b) == (b and a) ) # --> True

Associative


  • (A ∪ B) ∪ C = A ∪ (B ∪ C) = A ∪ B ∪ C
  • (AB)C = A(BC) = ABC

Set Logic

set1 = {'a', 'b', 'c'}
set2 = {'c', 'd', 'e'}
set3 = {'a', 'e', 'f'}

print((set1.union(set2)).union(set3) == (set3.union(set2)).union(set1)) # --> True
print((set1.intersection(set2)).intersection(set3) == (set3.intersection(set2)).intersection(set1)) # --> True

Boolean Logic

a = True
b = False
c = True

print( ((a or b) or c) == (a or (b or c)) ) # --> True
print( ((a and b) and c) == (a and (b and c)) ) # --> True

Distributive


  • A ∪ (BC) = (A ∪ B)(A ∪ C)
  • A(B ∪ C) = (AB) ∪ (AC)

Set Logic

set1 = {'a', 'b', 'c'}
set2 = {'c', 'd', 'e'}
set3 = {'a', 'e', 'f'}

print( (set2.intersection(set3)).union(set1) == (set1.union(set2)).intersection((set1.union(set3))) ) # --> True
print( (set2.union(set3)).intersection(set1) == (set1.intersection(set2)).union((set1.intersection(set3))) ) # --> True

Boolean Logic

a = True
b = False
c = True

print( (a or (b and c)) == ((a or b) and (a or c)) ) # --> True
print( (a and (b or c)) == ((a and b) or (a and c)) ) # --> True   

Idempotent Laws


  • when redundant operations achieve the same result
  • A ∪ A = A
  • AA = A

Set Logic

set1 = {'a', 'b', 'c'}

print( set1.union(set1) == set1 ) # --> True
print( set1.union(set1) == set1 ) # --> True

Boolean Logic

a = True

print( (a or a) == a ) # --> True
print( (a and a) == a ) # --> True

Domination Laws


  • Recall:
    • U = Universal Set, he set which contains all subsets
    • ∅ = Empty Set = { }
  • A ∩ U = A
  • A ∩ ∅ = ∅

Absorption Laws


  • A ∪ (AB) = A
  • A(A ∪ B) = A

Set Logic

set1 = {'a', 'b', 'c'}
set2 = {'c', 'd', 'e'}

print( set1.intersection(set2).union(set1) == set1 ) # --> True
print( set1.intersection(set1.union(set2)) == set1) # --> True

Boolean Logic

a = True
b = False
c = True

print( (a or (a and b)) == a) # --> True
print( (a and (a or b)) == a) # --> True


Identity Property

  • A ∪ ∅ = A

Complement Laws for Universal and Empty Set

  • ~∅ = U
  • ~U = ∅

Involution Law

  • ~( ~A) = A
a = True
print( (not (not a)) == a) # --> True

A helpful, unnamed law

  • AB ∪ A~B = A
a = True
b = False

print( ((a and b) or (a and not b)) == a) # --> True

DeMorgan’s Laws

  • 1st: ~(A ∪ B) = ~A ~B
  • 2nd: ~(AB) = ~A ∪ ~B

These laws are very helpful for logic and circuit reduction. They are commonly explored in interview questions

~(A ∪ B) = ~A ~B

a = True
b = False

print( (not (a or b)) == ((not a) and (not b)) ) # --> True

~(AB) = ~A ∪ ~B

a = True
b = False

print( (not (a and b)) == (not a or not b) ) # --> True


Probability Theory

Inferential Statistics is the practice of using mathematical analysis to make inferences about a population from a sample. The mathematics which underly inferential statistics are largely based on probability theory.

Calculating probability is attempting to figure out the likelihood of a specific event happening, given some number of attempts. The most fundamental and important probability calcululation is defined as:

The probability of some event $A$ occuring is the number of possible outcomes in that event, divided by the total number of possible outcomes in the sample space. That is,

$$ \text{Number of Outcomes in } A = |A| = \text{"The Cardinality of } A\text{"} $$ $$ \text{Number of Outcomes in } S = |S| = \text{"The Cardinality of } S\text{"} $$

$$ P(A) = \frac{|A|}{|S|} $$

Example

Given a fair six-sided die, what is the probability of rolling a 5?

$ Event\quad A = \text{Rolling a five}$

$P(A) = \frac{1}{6} = .166667$

Solution The total number of possible outcomes is six, in other words the cardinality of the sample space is six. There is only one outcome in which our die will show five pips, so the cardinality of our event $A$ is 1. Hence, our probability is $\frac{1}{6}$.

Notation Meaning
$P(A)$ Probability of A
$P(A^c)$ Probability of A complement
$P(AB)$ Probability of A intersect B
$P(A \cup B$ Probability of A union B
$P(A \mid B)$ probability of A given B


  • A permutation is one of several possible variations in which a set or number of objects can be ordered or arranged.

  • A permutation can be thought of as an arrangement of a number of items

  • $nPk$

    • where $n$ is the number of possible items
    • $k$ is how many of those items to arrange

Note: ORDER MATTERS

Discovery by Counting

$$ nPk = \frac{n!}{(n-k)!} $$

If we consider $n$ to be the base of a counting system, then we can determine all permutations $k$ by a counting/reduction approach.

  1. Count in base $n$ system
    • ex: $n = 3$

$\text{ 000 010 020 100 110 120 200 210 220 001 011 021 101 111 121 201 211 221 002 012 022 102 112 122 202 212 222 }$

  1. Reduce counts that have duplicate items

$\text{ 000 010 020 100 110 120 200 210 220 001 011 021 101 111 121 201 211 221 002 012 022 102 112 122 202 212 222 }$

  1. Consider $k$ items
    • ex: $k = 3$
012 021 102 120 201 210
    • ex: $k = 2$
12 21 02 20 01 10
    • ex: $k = 1$
2 1 0


$$ nPk = \frac{n!}{(n-k)!} $$

def permutations(n, k):
    return int(factorial(n) / factorial(n-k))
Slightly more optimized:
def permutations(n, k):
    perm = 1
    for i in range(n, n-k, -1):
        perm *= i
    return perm

$$ nCk = \frac{n!}{((n-k)! k!)} $$

def combinations(n, k):
    return int(factorial(n) / (factorial(n-k) * factorial(k)))

# Slightly more optimal:
def combinations(n, k):
    perm = 1
    for i in range(n, n-k, -1):
        perm *= i
    return int(perm / factorial(k))

def bernoulli(p_success=0.5):
    draw = random() # gets a val betw 0 and 1

    if draw < p_success:
        return True
    else:
        return False

  • 3 parameters
  • $n$ = number of bernoulli trials
  • $p$ = probability of success on any given bernoulli trial
  • $k$ = specific number of successes for which to find the probability

binomial_pmf(n,p,k)

$$ P(X=k) = {n \choose k} p^k(1-p)^{n-k} $$

def binomial_pmf(n, k, p=0.5):
    return combinations(n, k) * (p**k) * (1-p)**(n-k)

binomial_pmf_dict()

This should take 4 parameters:

  • n the number of trials
  • k_low the low value of $k$ in the dictionary
  • k_high the high value of $k$ in the dictionary
  • p=0.5 the probability of success of a given bernoulli trial
def binomial_pmf_dict(n, k_low, k_high, p=0.5):
    d = dict()

    for k in range(k_low, k_high+1):
        d[k] = binomial_pmf(n, k, p)

    return d

d = binomial_pmf_dict(8, 0, 8, p=0.25)

for k, v in d.items():
    print(f'{k}: {v}')

poisson_pmf()

  • $e = 2.71828$
  • Note, both the constant e and the factorial() function are available from the math module.

$$ P(\lambda, k \text{ events}) = \frac{e^{-\lambda}\lambda^k}{k!} $$

from math import e, factorial

# print(e) # 2.718281828459045

def poisson_pmf(lmbda, k):
    return lmbda**k * e**(-lmbda) / factorial(k)

poisson_pmf_dict()

  • your parameters will be
    • lmbda
    • low_k
    • high_k

Holding lmbda constant, write a function that returns a dictionary showing the probs for number of events from low_k to high_k (inclusive)

def poisson_pmf_dict(lmbda, low_k, high_k):
    d = dict()

    for k in range(low_k, high_k+1):
        d[k] = poisson_pmf(lmbda, k)

    return d

d = poisson_pmf_dict(10, 0, 30)

for k, v in d.items():
    print(f'{k}: {round(v, 6)}')

geometric_pmf()

  • p : probability
  • k : number of failures (inclusive or exclusive of the 1st success)
  • inclusive=True : whether or not to use inclusive or exclusive pmf

PMF calculating the number of trials up to and including the first success

$$ P(X=k) = p (1-p)^{k-1} $$

PMF calculating the number of trials before the first success

$$ P(X=k) = p (1-p)^{k} $$

def geometric_pmf(p, k, inclusive=True):
    return p * (1-p)**(k-inclusive)
    # if inclusive:
    #     return p * (1-p)**(k-1)
    # else:
    #     return p * (1-p)**k

poisson_cdf()

  • your parameters will be
    • lmbda
    • high_k
def poisson_cdf(lmbda, k_high):
    cdf = 0.0

    for k in range(k_high+1):
        cdf += poisson_pmf(lmbda, k)

    return cdf

binomial_cdf(n, k_high, p=0.5)

$$ P(X \le k) = \sum_{i=0}^k {n \choose i}p^i(1-p)^{n-i} $$

def binomial_cdf(n, k_high, p=0.5):
    cumulative = 0.0

    for k in range(0, k_high+1):
        cumulative += binomial_pmf(n, k, p)

    return cumulative






# hand coded algorithm
# library-called bubble sort




$$\sum_{n=1}^\infty \frac{1}{n^2} \to \textstyle \sum_{n=1}^\infty \frac{1}{n^2} \to \displaystyle \sum_{n=1}^\infty \frac{1}{n^2}$$

$\displaystyle \lim_{t \to 0} \int_t^1 f(t), dt$ versus $\lim_{t \to 0} \int_t^1 f(t), dt$.

$\Biggl(\biggl(\Bigl(\bigl((: x : )\bigr)\Bigr)\biggr)\Biggr)$



  • bash profile location on OSX : ~./bash_profile
function gitadder(){
git pull
git add .
git commit -m "Auto Updated: $(date '+%a)%M:%H %h %d %Y)"
git push
}






About


Languages

Language:Jupyter Notebook 100.0%