A couple of questions concerning your 1D sine example

Question

A couple of questions concerning your 1D sine example

pomorigio opened this issue 5 years ago · comments

Thank you so much for your contribution, it is very valuable for those who are not still very much familiar with Keras and Tensorflow, such as me! 🥇

Glancing through your 1D sine prediction example, I was pretty surprised of how accurate it is, provided that it is using 10 gaussians and only 15 activations within each hidden layer! Isn't usually more advisable to have more hidden nodes than output nodes to prevent from any information loss?

I am trying to use your code to emulate Bishop's example for an inverted sine, however, I am still not able to achieve very good prediction results, as you may see...
I seem to obtain much better results using Matlab for the very same set of hyperparameters:

Adam optimizer, step size = 1e-3, beta_1 = 0.9, beta_2 = 0.999
NSAMPLE = 1000, validation split = 0.3
Batch size = NSAMPLE (batch gradient descent)
N_HIDDEN = 20, N_MIXES = 3
1000 test samples
Nepochs = 3000

Please find also my Matlab image attached here (note that there 'validation' referest to test samples, and vice versa). The code I used (I barely introduced the modifications above from yours) is found bellow.

import keras
import mdn
import numpy as np
import matplotlib.pyplot as plt

## Generating some data:
NSAMPLE = 1000

x_data = np.random.uniform(0, 1, NSAMPLE)			# Predictor variable
y_data = x_data + 0.3*np.sin(2*np.pi*x_data) + np.random.uniform(-0.1, 0.1, NSAMPLE) # np.random.randn(n_row)
x_data, y_data = y_data, x_data

plt.figure(figsize=(8, 8))
plt.plot(x_data,y_data,'ro', alpha=0.3)
plt.show()

N_HIDDEN = 20
N_MIXES = 3

model = keras.Sequential()
model.add(keras.layers.Dense(N_HIDDEN, batch_input_shape=(None, 1), activation='tanh'))
model.add(mdn.MDN(1, N_MIXES))
model.compile(loss=mdn.get_mixture_loss_func(1,N_MIXES), optimizer=keras.optimizers.Adam()) #, metrics=[mdn.get_mixture_mse_accuracy(1,N_MIXES)])
model.summary()

history = model.fit(x=x_data, y=y_data, verbose=0, batch_size=NSAMPLE, epochs=3000, validation_split=0.3)

plt.figure(figsize=(10, 5))
plt.ylim([-3,3])
plt.plot(history.history['loss'])
plt.plot(history.history['val_loss'])
plt.show()

## Sample on some test data:
x_test = np.float32(np.arange(0,1,0.001))
NTEST = x_test.size
print("Testing:", NTEST, "samples.")
x_test = x_test.reshape(NTEST,1) # needs to be a matrix, not a vector

# Make predictions from the model
y_test = model.predict(x_test)
# y_test contains parameters for distributions, not actual points on the graph.
# To find points on the graph, we need to sample from each distribution.

# Sample from the predicted distributions
y_samples = np.apply_along_axis(mdn.sample_from_output, 1, y_test, 1, N_MIXES, temp=1.0)

# Split up the mixture parameters (for future fun)
mus = np.apply_along_axis((lambda a: a[:N_MIXES]),1, y_test)
sigs = np.apply_along_axis((lambda a: a[N_MIXES:2*N_MIXES]),1, y_test)
pis = np.apply_along_axis((lambda a: mdn.softmax(a[2*N_MIXES:])),1, y_test)

# Plot the samples
plt.figure(figsize=(8, 8))
plt.plot(x_data,y_data,'ro', x_test, y_samples[:,:,0], 'bo',alpha=0.3)
plt.show()
# These look pretty good!

# Plot the means - this gives us some insight into how the model learns to produce the mixtures.
plt.figure(figsize=(8, 8))
plt.plot(x_data,y_data,'ro', x_test, mus,'bo',alpha=0.3)
plt.show()
# Cool!

# Let's plot the variances and weightings of the means as well.
fig = plt.figure(figsize=(8, 8))
ax1 = fig.add_subplot(111)
ax1.scatter(x_data,y_data,marker='o', c='r', alpha=0.3)
for i in range(N_MIXES):
    ax1.scatter(x_test, mus[:,i], marker='o', s=200*sigs[:,i]*pis[:,i],alpha=0.3)
plt.show()

Do you have any idea about why this is happening?

Thank you so much in advance, and may you have a nice day!

Charles Martin · Answer 1 · Fri Mar 15 2019 04:23:50 GMT+0800 (China Standard Time)

Hrm, excellent questions! I'll run your code and see if I get any similar results.

Regarding my 1D example, I guess I was trying to follow Hardmaru's example with the structure there, but reducing the number of Gaussian components.

Actually I find the most satisfying solution is to use 5 mixture components (one for each leg of the output function), but that doesn't always converge, so I've had good experience using 10 (I train that network live in lectures :-D )

I'm not sure that there's any hard rules about number of hidden units vs number of output nodes, e.g., see the (old) comments here: ftp://ftp.sas.com/pub/neural/FAQ3.html#A_hu

I'll try out your code and get back to you soon!

Charles Martin · Answer 2 · Fri Mar 15 2019 05:35:08 GMT+0800 (China Standard Time)

Hi, I put your code in a Colab notebook and tried some new plots and a few experiments.

https://colab.research.google.com/drive/1T1NxtKS4TC9nbgZGOZSPQ65YXw-q4lND

It seems to me that the network is training well, but, perhaps the variances could be tighter and the sampling procedure isn't working very well. Maybe the variances aren't interpreted properly?

When I used the "temperature" arguments in the sampling procedure, it produces more acceptable results.

I've had similar problems with other MDNs, so this could be something to look into.

Aitor López · Answer 3 · Fri Mar 22 2019 18:25:15 GMT+0800 (China Standard Time)

Hello again Charles,

Thank you so much for your prompt reply! I have spent these days playing a little bit with your code and was able to get some satisfiable results.

Increasing the number of hidden nodes does definitely have an impact on performance for your first example, and I would argue that is because as I mentioned having less hidden nodes than output nodes acts as some kind of 'bottleneck' in any approximation.

In addition, I could see the strong influence that temperature has on the results! By increasing 'temp' value as close as 1 as possible and doing the opposite with 'sigma_temp', I was able to approach the results from my Matlab version pretty accurately.

Actually, my task deals with a 2D-input / 2D-output problem and I saw your examples are 1D-1D and 2D-1D only. As I am on the verge of finishing Bishop's 2D-2D example from the paper I referred beforehand using your code, and if you want I can pass it to you so you can upload it in this repository (I am brand new using GitHub and I don't know how to do it).

(And last but not least, I found another issue, but I would better ask it in a separate question because it is completely different!)

Thank you again for your kind support! :)

Charles Martin · Answer 4 · Sun Mar 24 2019 17:54:30 GMT+0800 (China Standard Time)

Hey! Great that you're figuring this out now, if you get your other example to work, It would be great if you posted the code directly in here as a comment.

Actually some of my other examples use >2D inputs and outputs but in those cases, the ANN is configured as a sequence predicting RNN.