统计学的Python实现

学过了国外高校的经典教材《Statistics for Business and Economics》，但统计学的理论掌握仍然停留在纸面上，很容易淡忘。理论总是枯燥的。

刷过了《Python for Data Analysis 》，但这本书更偏向于工具书，代码比较孤立，缺乏应用场景。那么多函数、方法，也是敲过就忘。包括一些包的官方文档。

浏览过的，将二者结合的博客、教程，要么太浅不深入，要么不贴合应用。（天下乌鸦一般黑，老外的书名也很标题党。）专门适用于统计学的R语言，以及相关的好内容可能有，但《R in action》浏览了下，暂时分不出精力，还是先专精好Python吧。

后来总算找到一个Statistics in Python 的教程，非常棒，更当得起‘Python for Data Analysis’这个名头。因为自身学习需求，在此梳理及融会所学的统计学理论、应用、代码实现。涉及的包：Numpy/ Pandas/ Scipy/ Sklearn

归一化：量纲处理 preprocessing.scale()，无关分布
分布偏斜的处理：将偏态分布转换为正态分布
变量相关系数：
- 两变量间（线性）相关关系的度量：皮尔逊（积矩）相关系数
- scatter_matrix
- 注：这里讨论的相关性，限于两个单独的变量；
  - 而后面假设检验所讨论的“相关性”，基于样本与总体、样本与样本，限于（样本/总体）均值、比例
缺失值插补 : sklearn.preprocessing.Imputer

假设检验基础概念
什么情况下用T-test：
- One-Sample T-test: stats.ttest_1samp() Calculates the T-test for the 【mean】 of ONE group of scores.
- Two-Sample T-test: stats.ttest_ind() Calculates the T-test for the 【means】 of TWO INDEPENDENT samples of scores.
- paired T-test:
  stats.ttest_rel() Calculates the T-test on TWO RELATED samples of scores

拟合优度检验：检验的对象是【2个分布】，（过程）只涉及1个变量，但1个分类变量涉及不同水平
- stats.chisquare(f_obs=observed,f_exp=expected)
  --The chi square test tests the null hypothesis that the categorical data has the given frequencies.
独立性检验：检验的对象是【2个变量】（的相关性），（过程）在同一分布中，2个变量的不同水平两两比较
- stats.chi2_contingency()
  --Chi-square test of independence of variables in a contingency table

Zorro-Lin-7 / Statistics-and-Data-Analysis-in-Python