Study-09-MachineLearning-C-[Deep Learning]

DeepLearning Intro http://jalammar.github.io/visual-interactive-guide-basics-neural-networks/

NN's each "NODE" is a linear model(classifier) with so many featuresssss whose weights(parameters) are updated by the optimization algorithm like a Gradient_Descent that minimizing the cost function("MSE" for regression form, "CrossEntropy" for logistic regression form), and this linear model(classifier) becomes the input of the activation function (step for discrete, sigmoid for continous) that returns a bunch of outcome of either classification/probability information per each feature datapoint.

What if the data is not linearly separable ?

How to Combine two linear models into a non-linear model?: Each linear model is a whole probability space, which means for every points, it gives us the probability of the point being 'positive'. And we have two linear models, thus we get two probability values.

Example01: We add up two probabilities, then pass into the Sigmoid function, which gives us the final probability value!
Example02: But what if we need to weight this sum?
- We take a [LINEAR-COMBINATION] of the two linear models(thinking of the new line yielded from the two models)

Deep Neural Network(More complex network and multiple layers):

Basic 3 Layers
- Input-layer: input values that constitute each datapoint (from field_A, field_B,..or from the Sigmoid_A, Sigmoid_B..)
- Hidden-layer: a set of linear models generated by the Input-layer, and probabilities from the Sigmoid().
- Output-layer: where the two linear models(two Sigmoid-outputs) get combined to obtain a non-linear model, and a single probability from the Sigmoid().
If adding more nodes to the input, hidden, and output layers?
- What happens if the Hidden-layer has more nodes?(more models with an activate func)
  - We combine more linear models and obtain a triangular boundary in the Output-layer.
- What happens if the Input-layer has more nodes?(higher dimension)
  - our linear models become planes..and our output-layer bounds a nonlinear region.
- What if the Output-layer has more nodes?(multi-classes)
  - We get more outputs. This is when we have a multiclass classification model. We'll have each node in the Output-layer outputting a score for each one of the classes(one for A, one for B...)
If adding more layers?
- Our linear models combine to create non-linare models, then these combine to create even more non-linear models.
- In this way, the network will split the n-dimensional space with a highly non-linear boundary as the output.

Multiclass Classification: If our neural network needs to model data with more than one output?
- Simply add more nodes in the Output-layer!
- We take the scores and apply the softmax() function to obtain well-defined probabilities.

Fitting your model

How do neural networks process the input to obtain an output?

1.Feedforward

What parameters(W,b) should they have on the edges(x1, x2) in order to model our data well?

The perceptron(the simplist NN) here is defined by a linear model where W1 > W2.
Then the perceptron plots the points(x1, x2) and outputs the odds that the point is positive.

Error-Function(How badly each point is being classified? How far from the line?)

2.Backpropagation (How to update the model parameters?)

1. Doing a feedforward operation
1. Comparing the output of the model with the desired output.
1. Calculating the error.
1. Running the feedforward operation backwards (backpropagation) to spread the error to each of the weights.
1. Use this to update the weights, and get a better model.
1. Continue this until we have a model that is good.

[GradientDescentAlgorithm & Backpropagation]

Single Perceptron

We calculate the Gradient of the Error-Function E(W)
- What the misclassified-point want: the boundary to come closer to it then, the boundary get closer to it by updating (W,b).
- We continue doing this to minimize the error.

Multi-layer Perceptron

The Error-Function is more complicated, thus we do Backpropagation:
- What the misclassified-point want: the (+) region to come closer to it then,
  - when looking at the two linear models in the hidden-layer, we can see which one is doing better.
  - so we care the better linear model more than the other.
    - Reduce the W coming from the loser and increase the W coming from the winner.
  - or we go back to the hidden layer and for the loser model, we have its boundary get closer to the point by updating (W,b), and for the winner model, we have its boundery move farther away from the point by updating (W,b).

Example

Building a Neural Network in Keras

import numpy as np
from keras.models import Sequential
from keras.layers.core import Dense, Activation

model = Sequential()

model: The keras.models.Sequential class is a wrapper for the neural network model that treats the network as a sequence of layers.
- It implements the Keras model interface with common methods like
  - .compile(), .fit(), .evaluate() and .predict_proba() that are used to train and run the model.
layers: The keras.layers class provides a common interface for a variety of standard neural network layers:
- fully connected layers
- max pool layers
- activation layers, etc.
- We can add a layer to a model using the model's add() method.
- Keras requires the input shape to be specified in the first layer, then it will automatically infer the shape of all other layers. This means we only have to explicitly set the input dimensions for the first layer.

For example, a simple model with a single hidden layer might look like this:

X has shape (num_rows, num_cols), where the training data are stored as row vectors.
y must have an output vector for each input vector.

X = np.array([[0, 0], [0, 1], [1, 0], [1, 1]], dtype=np.float32)
y = np.array([[0], [0], [0], [1]], dtype=np.float32)


# 1st Layer - Add an input layer of 32 nodes with the same input shape as the training samples in X
model.add(Dense(32, input_dim=X.shape[1]))

# Add an activation layer
model.add(Activation('softmax'))

# 2nd Layer - Add a fully connected output layer
model.add(Dense(1))

# Add an activation layer
model.add(Activation('sigmoid'))

The first(hidden) layer model.add(Dense(32, input_dim=X.shape[1])): creates 32 nodes(or 32 models) which each expect to receive 2-element vectors(shape[1]:two columns) as inputs. Each layer takes the outputs from the previous layer as inputs and pipes through to the next layer. This chain of passing output to the next layer continues until the last layer, which is the output of the model. We can see that the output has dimension 1.

The activation layers are equivalent to specifying an activation function in the Dense layers. For example, model.add(Dense(128)); model.add(Activation('softmax')) is computationally equivalent to model.add(Dense(128, activation="softmax")), but it is common to explicitly separate the activation layers because it allows direct access to the outputs of each layer before the activation is applied (which is useful in some model architectures).

Once we have our model built, we need to compile it before it can be run. Compiling the Keras model calls the backend (tensorflow, theano, etc.) and binds the optimizer, loss function, and other parameters required before the model can be run on any input data.

model.compile(loss="categorical_crossentropy", optimizer="adam", metrics = ["accuracy"])

We'll specify the loss function to be categorical_crossentropy which can be used when there are only two classes, and specify adam as the optimizer (which is a reasonable default when speed is a priority). And finally, we can specify what metrics we want to evaluate the model with. Here we'll use accuracy.

Keras loss= :
- mean_squared_error, mean_absolute_error, mean_squared_logarithmic_error
Keras optimizer=
Keras metrics=

The model is trained with the fit(). vervose is the message level(how much information we want displayed on the screen during training).

model.fit(X, y, epochs=1000, verbose=0)
model.evaluate()

Example_01. (logical operator)

Perceptron can be a logical operator: AND(intersection SET), OR(union SET), XOR(difference SET), NOT
- Take two inputs then returns an output.
- Modify the parameters(W,b)
When our data is not linearly separable. Multi-layer NN can classify them.

weight1? weight2? bias?

import pandas as pd

test_inputs = [(0, 0), (0, 1), (1, 0), (1, 1)]
correct_outputs = [False, False, False, True]
outputs = []

for i, co in zip(test_inputs, correct_outputs):
    linear_combination = weight1 * i[0] + weight2 * i[1] + bias
    output = int(linear_combination >= 0)
    is_correct_string = 'Yes' if output == co else 'No'
    outputs.append([i[0], i[1], linear_combination, output, is_correct_string])

num_wrong = len([o[4] for o in outputs if o[4] == 'No'])
output_frame = pd.DataFrame(outputs, columns=['Input 1', '  Input 2', '  Linear Combination', '  Activation Output', '  Is Correct'])
if not num_wrong:
    print('Nice!  You got it all correct.\n')
else:
    print('You got {} wrong.  Keep trying!\n'.format(num_wrong))
print(output_frame.to_string(index=False))

AND Perceptron: weights and bias ?

weight1 = 1.0
weight2 = 1.0
bias = -2.0

OR Perceptron: weights and bias ?

two ways to go from an AND perceptron to an OR perceptron.
- increase the weights
- decrease the magnitude of the bias

weight1 = 2.0
weight2 = 2.0
bias = -2.0

weight1 = 1.0
weight2 = 1.0
bias = -1.0

NOT Perceptron: weights and bias ?

the NOT operation only cares about one input.
- The operation returns a '0' if the input is 1.
- The operation returns a '1' if it's a 0.
- The other inputs to the perceptron are ignored. If we ignore the first input, then...

weight1 = 0.0
weight2 = -2.0
bias = 1.0

XOR Multi-Layer Perceptron(cross-OR ?)

[What if it's impossible to build the linear decision surface ?]
- Combine perceptrons: "the output of one = the input of another one"...'Neural Network'
- This is a simple multi-layer feedforward neural network.

Set the first layer to a Dense() layer with an output width of 8 to 32 nodes and the input_dim set to the size of the training samples (in this case, input_dim=2).
Set the output layer width to 1 nodes???, since the output has only two classes. (We can use 0 for one class an 1 for the other)
Use a sigmoid activation function after the output layer.
Run the model for 50 epochs.

import numpy as np
from keras.utils import np_utils
import tensorflow as tf
np.random.seed(42)


# Our data
X = np.array([[0,0],[0,1],[1,0],[1,1]]).astype('float32')
y = np.array([[0],[1],[1],[0]]).astype('float32') 

# One-hot encoding the output: convert class vectors to binary class matrices
y = np_utils.to_categorical(y)

# Initial Setup for Keras
from keras.models import Sequential
from keras.layers.core import Dense, Activation, Flatten

# Building the model
xor = Sequential()
xor.add(Dense(32, input_dim=2))
xor.add(Activation("sigmoid"))
xor.add(Dense(2))
xor.add(Activation("sigmoid"))
xor.compile(loss="categorical_crossentropy", optimizer="adam", metrics = ['accuracy'])

# print the model architecture
xor.summary()


# Fitting the model
history = xor.fit(X, y, nb_epoch=50, verbose=0)

# Scoring the model
score = xor.evaluate(X, y)
print("\nAccuracy: ", score[-1])

# Checking the predictions
print("\nPredictions:")
print(xor.predict_proba(X))

The final prediction(matrix) tells the probability of each data point! The xor.evaluate(X, y) gives Accuracy:0.50 -> 0.75, which means out of 4 input points, we're correctly classifying only 3 of them. Let's try to change some parameters around to improve. For example, you can increase the number of epochs, nodes ?? Can we reach 100% ?

Example_02. (Student Admissions)

==> this is the project!

[Training Optimization]

there are so many things that can fail...

overfitting?
poorly chosen architecture?
noisy data?
model-running time?

http://ruder.io/optimizing-gradient-descent/index.html#rmsprop

1. `Overfitting`: Early Stopping Method

2. `Overfitting`: Regularization Method

choose L1:

In case we want "Sparce vector"(the smaller weights tend to go to 'Zero'), and if we want to reduce the No.of weights and end up with a small set....
Sometimes we have a problem with hundreds of features, and L1 helps us select which features are important.It will turn the rest into zeros.

choose L2:

If we want to maintain all the weights homogeneously small....and want smaller Error..
This normally gives better results for training models, so we people use the most.

3. `Unbalanced Edges`: Dropout Method

When we train neural network, sometimes one part of the network has very large weights and it ends up dominating all the training. To solve this, we turn the dominating input off and let the rest train. More thoroughly, we go through the epochs, we randomly turn off some nodes(hey, you shall not pass through here). In that case, the other nodes have to pick up the slack and take more part in the training. On average, each node will get the same treatment.

4. `Vanishing Gradient`: Other Activation Function

Problem of "Gradient-Descent": Local Minima & Vanishing Gradient

Solution: H-Tangent? ReLU?

5. `wrong learning rate`: Optimizers

Problem of wrong learning rate

What "learning-rate" to use?

If it's too big, then we will take several huge single steps, which could be fast at the beginning, but we may miss the minima and keep going. This is chaotic.
If it's too small, then we will have steady steps and a better chance of arriving to our local minima, which makes our model slow, but a good-rule-of-thumb is that if your model isn't working, decrease the learning-rate.
Learning Rate optimizers
- RMSProp (RMS stands for Root Mean Squared Error): It decreases the learning rate by dividing it by an exponentially decaying average of squared gradients.
- AdaGrad:
- Adam: the most popular. it builds on RMSprop and adds momentum!

5. `Being stuck at Local Minima`: Random Restart or Momentum ?

We start from a few different random places and do gradient descent from all of them. This increases the odds that we will get the Global Minima.

The idea is that you walk a bit fast with momentum and determination in a way that if you get stuck at local minima, you power through and get over the hump to look for a lower minima. Momentum is a constant beta b/w 0 and 1 and the beta attaches to the steps. For example,

the previous step gets multiplied by '1', then one beforebeta-squared, then one beforebeta-cubed...
Once we get the Global Minima, it will be still pushing us but not as much.
Adam (Adaptive Moment Estimation) uses a more complicated exponential decay that consists of not just considering the average (first moment), but also the variance (second moment) of the previous steps.

6. `Long running time`: Batch & Stochastic-Gradient-Descent(SGD)

Problem of long running time

Well...we don't need to plug in all our data every time we take a step. We only use a bunch of random subsets of our data. It would not be the best estimate of the gradient but it's quick and because of its iteration, the accuracy is also great. This is where "Stochastic Gradient Descent" comes into play. Since we still need to use all data, we split the data into several batches. In practice, it's much better to take a bunch of slightly inaccurate steps than to take one good one.

[Parameters for SGD]:
- Learning rate
- Momentum (This takes the weighted average of the previous steps, in order to get a bit of momentum and go over bumps, as a way to not get stuck in local minima).
- Nesterov Momentum (This slows down the gradient when it's close to the solution)

Another way to deal with non-linear data

A piecewise Linear Function and NN-Regression

What if, at the end of the network, we expect it to return any numbers...It's a regression !!! (Just by removing the final activation function!!)
The final value would be a weighted SUM of the outputs of the previous layer.
In order to train this network, we'd use a different Error-Function for calculationg the directions:
- MSE (the AVG of the square of the difference b/w the real labels and the predictions)

mainkoon81 / Study-09-MachineLearning-C

Study-09-MachineLearning-C-[Deep Learning]

What if the data is not linearly separable ?

Deep Neural Network(More complex network and multiple layers):

Fitting your model

1.Feedforward

2.Backpropagation (How to update the model parameters?)

Building a Neural Network in Keras

Keras `optimizer=`

Keras `metrics=`

Example_01. (logical operator)

Example_02. (Student Admissions)

[Training Optimization]

1. `Overfitting`: Early Stopping Method

2. `Overfitting`: Regularization Method

3. `Unbalanced Edges`: Dropout Method

4. `Vanishing Gradient`: Other Activation Function

5. `wrong learning rate`: Optimizers

5. `Being stuck at Local Minima`: Random Restart or Momentum ?

6. `Long running time`: Batch & Stochastic-Gradient-Descent(SGD)

Another way to deal with non-linear data

About

Languages

Study-09-MachineLearning-C-[Deep Learning]

What if the data is not linearly separable ?

Deep Neural Network(More complex network and multiple layers):

Fitting your model

1.Feedforward

2.Backpropagation (How to update the model parameters?)

Building a Neural Network in Keras

Keras optimizer=

Keras metrics=

Example_01. (logical operator)

Example_02. (Student Admissions)

[Training Optimization]

1. Overfitting: Early Stopping Method

2. Overfitting: Regularization Method

3. Unbalanced Edges: Dropout Method

4. Vanishing Gradient: Other Activation Function

5. wrong learning rate: Optimizers

5. Being stuck at Local Minima: Random Restart or Momentum ?

6. Long running time: Batch & Stochastic-Gradient-Descent(SGD)

Another way to deal with non-linear data

About

Languages

Keras `optimizer=`

Keras `metrics=`

1. `Overfitting`: Early Stopping Method

2. `Overfitting`: Regularization Method

3. `Unbalanced Edges`: Dropout Method

4. `Vanishing Gradient`: Other Activation Function

5. `wrong learning rate`: Optimizers

5. `Being stuck at Local Minima`: Random Restart or Momentum ?

6. `Long running time`: Batch & Stochastic-Gradient-Descent(SGD)