sukanyasaha007 / Hanhan_Data_Science_Practice

data analysis, big data development, cloud, and any other cool things!

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Hanhan_Data_Science_Practice

data analysis, big data development, cloud, and any other cool things!


BIG DATA! - Fantastic

  • Why Spark is great?!

  • How to run Spark through terminal command line

    • Download Spark here: https://spark.apache.org/downloads.html
    • Unpack that somewhere you like. Set an environment variable so you can find it easily later (CSH and BASH versions): setenv SPARK_HOME /home/you/spark-1.5.1-bin-hadoop2.6/, export SPARK_HOME=/home/you/spark-1.5.1-bin-hadoop2.6/
    • Then ${SPARK_HOME}/bin/spark-submit --master local [your code file path] [your large data file path as input, this one only exist when you have sys.argv[1]]
  • Automation


R PRACTICE

Note: The Spark R Notebook I am using is community editon, because R version maybe lower, many package in R Basics have not been supported.


PYTHON PRACTICE

Note: I'm using Spark Python Notebook, some features are unique there. Because my own machine could not install the right numpy version for pandas~

  • Multi-Label Problem

  • Factorization Machines

    • Large dataset can be sparse, with Factorization, you can extract important or hidden features
    • With a lower dimension dense matrix, factorization could represent a similar relationship between the target and the predictors
    • The drawback of linear regression and logistic regression is, they only learn the effects of all features individually, instead of in combination
    • For example, you have Fields Color, Category, Temperature, and Features Pink, Ice-cream, Cold, each feature have different values
      • Linear regression: w0 + wPink * xPink + wCold * xCold + wIce-cream * xIce-cream
      • Factorization Machines (FMs): w0 + wPink * xPink + wCold * xCold + wIce-cream * xIce-cream + dot_product(Pink, Cold) + dot_product(Pink, Ice-cream) + dot_product(Cold, Ice-cream)
        • dot-product: a.b = |a|*|b|cosθ, when θ=0, cosθ=1 and the dot product reaches to the highest value. In FMs, dor product is used to measure the similarity
        • dot_product(Pink, Cold) = v(Pink1)*v(Cold1) + v(Pink2)*v(Cold2) + v(Pink3)*v(Cold3), here k=3. This formula means dot product for 2 features in size 3
      • Field-aware factorization Machines (FFMs)
        • Not quite sure what does "latent effects" meantioned in the tutorial so far, but FFMs has awared the fields, instead of using dot_product(Pink, Cold) + dot_product(Pink, Ice-cream) + dot_product(Cold, Ice-cream), it's using Fields here, dot_product(Color_Pink, Temperature_Cold) + dot_product(Color_Pink, Category_Ice-cream) + dot_product(Temperature_Cold, Category_Ice-cream), Color & Temperature, Color & category, Temperature & Category
    • xLearn library
      • Sample input (has to be this format, libsvm format): https://github.com/aksnzhy/xlearn/blob/master/demo/classification/criteo_ctr/small_train.txt
      • Detailed documentation about parameters, functions: http://xlearn-doc.readthedocs.io/en/latest/python_api.html
      • Personally, I think this library is a little bit funny. First of all, you have to do all the work to convert sparse data into dense format (libsvm format), then ffm will do the work, such as extract important features and do the prediction. Not only how it works is in the blackbox, but also it creates many output files during validation and testing stages. You's better run evrything through terminal, so that you can see more information during the execution. I was using IPython, totally didin't know what happened.
      • But it's fast! You can also set multi-threading in a very easy way. Check its documentation.
    • My code: https://github.com/hanhanwu/Hanhan_Data_Science_Practice/blob/master/Factorization_Machines.ipynb
      • My code is better than reference
    • Reference: https://www.analyticsvidha.com/blog/2018/01/factorization-machines/?utm_source=feedburner&utm_medium=email&utm_campaign=Feed%3A+AnalyticsVidhya+%28Analytics+Vidhya%29
  • RGF

    • My code: https://github.com/hanhanwu/Hanhan_Data_Science_Practice/blob/master/try_RGF.ipynb
      • Looks like the evaluation result is, too bad, even with Grid Search Cross Validation
    • reference: https://www.analyticsvidhya.com/blog/2018/02/introductory-guide-regularized-greedy-forests-rgf-python/?utm_source=feedburner&utm_medium=email&utm_campaign=Feed%3A+AnalyticsVidhya+%28Analytics+Vidhya%29
      • It's missing code in the reference and it's lack of evaluation step
    • RGF vs. Gradient Boosting
      • Boosting add weights to misclassified observations for next base algorithm, in each iteration. RGF changes forest structure by one step to minimize the logloss, and also adjust the leaf weights for the entire forest to minimize the logloss, in each iteration
      • RGF searches optimum structure changes
        • The search is within the newly created k trees (default k=1), otherwise the computation can be expensive
        • Also for computational efficiency, only do 2 types of operations:
          • split an existing leaf node
          • create a new tree
        • With the weights of all leaf nodes fixed, it will try all possible structure changes and find the one with lowest logloss
      • Weights Optimization
        • After every 100 new leaf nodes added, the weights for all nodes will be adjusted. k=100 by default
        • When k is very large, it's similar to adjust weights at the end; when k is very small, it can be computational expensive since it's similar to adjust all nodes' weights after each new leaf node added
      • It doesn't need to set tree size, since it is determined through logloss minimizing process, automatically. What you can set is max leaf nodes and regularization as L1 or L2
      • RGF may has simpler model to train, compared with boosting methods, since boosting methods require small learning rate and large amount of estimators
  • Regression Spline

    • Still, EXPLORE DATA first, when you want to try regression, check independent variables (features) and dependent variable (label) relationship first to see whether there is linear relationship
    • Linear Regression, a linear formula between X and y, deals with linear relationship; Polynomial Regression converts that linear formula into a polnomial one, and it can deal with non-linear relationship.
    • When we increase the power value in polynomial regression, it will be easier to become over-fitting. Also with higher degree of polynomial function, the change of one y value in the training data can affect the fit of data points far away (non-local problem).
    • Regression Spline (non-linear method)
      • It's trying to overcome the problems in polynomial regression. When we apply a polynomial function to the whole dataset, it may impose the global data structure, so how about fit different portion of data with different functions
      • It divides the dataset into multiple bins, and fits each bin with different models regression spline
      • Points where the division occurs are called "Knots". The function used for each bin are known as "Piecewise function". More knots lead to the more flexible piecewise functions. When there are k knots, we will have k+1 piecewise functions.
      • Piecewise Step Functions: having a function remains constant at each bin
      • Piecewise Polynomials: each bin is using a lower degree polynomial function to fit. You can consider Piecewise Step Function as Piecewise Polynomials with degree as 0
      • A piecewise polynomial of degree m with m-1 continuous derivates is a "spline". This means:
        • Continuous plot at each knot
        • derivates at each knot are the same
        • Cubic and Natural Cubic Splines
          • You can try Cubic Spline (polinomial function has degree=3) to add these constraints so that the plot can be more smooth. Cubic Spline has k knots with k+4 degree of freedom (this means there are k+4 variables are free to change)
          • Boundrary knots can be unpredictable, to smooth them out, you can use Natural Cubic Spline
      • Choose the number and locations of knots
        • Option 1 - Place more knots in places where we feel the function might vary most rapidly, and to place fewer knots where it seems more stable
        • Option 2 - cross validation to help decide the number of knots:
          • remove a portion of data
          • fit a spline with x number of knots on the rest of the data
          • predict the removed data with the spline, choose the k with the smallest RMSE
      • Another method to produce splines is called “smoothing splines”. It works similar to Ridge/Lasso regularisation as it penalizes both loss function and a smoothing function
    • My Code [R Version]: https://github.com/hanhanwu/Hanhan_Data_Science_Practice/blob/master/learn_splines.R regression splies R

DIMENSION REDUCTION


DATA PREPROCESSING


TREE BASED MODELS


GRAPH THEORY

  1. Closeness Centrality – Of a node is the average length of the shortest path from the node to all other nodes
  2. Betweenness Centrality – Number of times a node is present in the shortest path between 2 other nodes

ADVANCED TOOLS


CLOUD for DATA SCIENCE


KAGGLE PRACICE

-- Notes

  • Dimensional Reduction: I tried FAMD model first, since it supposed to handle the mix of categorical and numerical data. But my laptop didn't have enough memory to finish this. Then I changed to PCA, but needed to convert categorical data into numerical data myself first. After running PCA, it shows that the first 150-180 columns comtain the major info of the data.
  • About FAMD: FAMD is a principal component method dedicated to explore data with both continuous and categorical variables. It can be seen roughly as a mixed between PCA and MCA. More precisely, the continuous variables are scaled to unit variance and the categorical variables are transformed into a disjunctive data table (crisp coding) and then scaled using the specific scaling of MCA. This ensures to balance the influence of both continous and categorical variables in the analysis. It means that both variables are on a equal foot to determine the dimensions of variability. This method allows one to study the similarities between individuals taking into account mixed variables and to study the relationships between all the variables. It also provides graphical outputs such as the representation of the individuals, the correlation circle for the continuous variables and representations of the categories of the categorical variables, and also specific graphs to visulaize the associations between both type of variables. https://cran.r-project.org/web/packages/FactoMineR/FactoMineR.pdf
  • The predictive analysis part in R code is slow for SVM and NN by using my laptop (50GB disk memory availabe). Even though 150 features have been chosen from 228 features
  • Spark Python is much faster, but need to convert the .csv file data into LabeledPoint for training data, and SparseVector for testing data.
  • In my Spark Python code, I have tried SVM with SGD, Logistic Regression with SGD and Logistic Regression with LBFGS, but when I tune the parameters for SVM and Logistic Regression with SGD, they always returned an empty list wich should show those people who will buy insurance. Logistic Regression with LBFGS gives better results.

OTHER

About

data analysis, big data development, cloud, and any other cool things!

License:MIT License


Languages

Language:Jupyter Notebook 93.8%Language:HTML 4.6%Language:R 1.5%Language:Python 0.1%Language:Java 0.0%