Validation accuracy of example ERP.py suddently dropped drastically, with an additional warning

Question

Validation accuracy of example ERP.py suddently dropped drastically, with an additional warning

moonj94 opened this issue 4 years ago · comments

I am getting the following warning:

Epoch 1/300 WARNING:tensorflow:Entity <function Function._initialize_uninitialized_variables.<locals>.initialize_variables at 0x7f8af8da4830> could not be transformed and will be executed as-is. Please report this to the AutoGraph team. When filing the bug, set the verbosity to 10 (on Linux, export AUTOGRAPH_VERBOSITY=10) and attach the full output. Cause: WARNING: Entity <function Function._initialize_uninitialized_variables.<locals>.initialize_variables at 0x7f8af8da4830> could not be transformed and will be executed as-is. Please report this to the AutoGraph team. When filing the bug, set the verbosity to 10 (on Linux, export AUTOGRAPH_VERBOSITY=10`) and attach the full output. Cause:

Epoch 00001: val_loss improved from inf to 1.38645, saving model to /tmp/checkpoint_original.h5
144/144 - 2s - loss: 1.4062 - accuracy: 0.2986 - val_loss: 1.3864 - val_accuracy: 0.2222
Epoch 2/300

Epoch 00002: val_loss did not improve from 1.38645
144/144 - 0s - loss: 1.3726 - accuracy: 0.3194 - val_loss: 1.3865 - val_accuracy: 0.2361
Epoch 3/300

Epoch 00003: val_loss did not improve from 1.38645
144/144 - 0s - loss: 1.3613 - accuracy: 0.3750 - val_loss: 1.3865 - val_accuracy: 0.2361
Epoch 4/300
WARNING:tensorflow:Can save best model only with val_loss available, skipping.
Traceback (most recent call last):`

I should let you know that since yesterday, I have created my own separate script to analyze my own data. I am getting the same warning and accuracies are low - and I can't be sure if it is because the data is actually not separable.

vlawhern · Answer 1 · Wed Dec 09 2020 02:29:29 GMT+0800 (China Standard Time)

So just to be clear: you're not modeling the MNE sample data anymore, but you're using your own data.

Well in my experience it can take a while to start seeing good results depending on the data you're modeling (sometimes needing 50+ epochs before starting to see validation set loss dropping). There are also other things to consider that could describe the behavior you're seeing, for example

the EEGNet model configuration might need to change from the default configuration for your particular data
the optimizer batch size might need to change (too low and your gradients are quite noisy, too high and learning might be slow).. I've generally found batch size = 64 should work in most cases for traditional BCI tasks (see the paper https://iopscience.iop.org/article/10.1088/1741-2552/aace8c or its preprint if you don't have journal access https://arxiv.org/abs/1611.08024). The paper uses batch size = 64 and we got good performance across 4 different classification tasks that have different feature representations.
2a. your choice of optimizer (Adam with default parameters generally just "works" out of the box)
the amount of data you have to train on (you generally want larger datasets, say 10K+ trials, for deep learning in general to be better than baseline approaches)
how you split the data into training/validation/test sets (is this a cross-subject splitting or a 5-fold within-subject split, etc).
how you preprocessed the EEG data (filtering effects, sampling rate, referencing, artifact removal, etc).

The issue you're seeing could be due to any combination of the above, or as you say, your data might not be separable.

You'll probably want to start with a more classical approach first (i.e. non deep learning) to see if your data is indeed separable (as evidenced by non-random classification performance), then move to more advanced approaches afterwards. The pyriemann package implements many good (in my opinion state of the art) non-DL classifiers so you could try that first. Any advice beyond that would be very specific to your data and use case...

Jaewoong Moon · Answer 2 · Thu Dec 10 2020 04:28:15 GMT+0800 (China Standard Time)

I tried modifying the batchsize = 64 and the results I am getting are not terrible when looking at one class. Here are the confusion matrices:

How could one optimize this to produce better results?
For instance, what do the parameters kernLength, F1, D, and F2 represent?

I realize this isn't an issue with the code per se, but your input would be deeply appreciated.

vlawhern · Answer 3 · Thu Dec 10 2020 04:49:30 GMT+0800 (China Standard Time)

So a couple of things:

xDAWN + Riemannian Geometry (RG) is I believe the state-of-the-art reference algorithm if your classification task is an event-related potential (ERP). If the data you're modeling is not an ERP then I would choose something different.
For an EEG sequence at a specific sampling rate (say 128Hz), the kernLength is the length of time, in samples, that you want to model. Specifically, a kernLength = 64 when sampling rate = 128Hz means you're modeling 0.5 seconds of data (64/128). For a kernel length of 0.5 seconds you can capture frequencies at 2Hz and above (1/0.5 seconds). So kernLength is in a way setting the minimum frequency you want the model to try and capture; shorter kernLengths means you want the model to find higher-frequency signals, while longer kernLengths means you want to find slower frequency signals. F1 is the number of temporal frequencies/filters you want to try and capture and D specifies the number of spatial filters to learn per temporal filter; this is primarily a dimension reduction strategy. F2 is a bit harder to explain, but conceptually this is attempting to model different combinations of your temporal-spatial frequency filters (many EEG phenomena are explained by combinations of frequencies in both time and space across the scalp, so this is a way to try and model that behavior). I've found in general that F2 = F1 * D seems reasonable but I haven't really tested this much.

One way to improve your results is if you know in general the frequency content you're looking for (perhaps you don't need to model very slow signals so you can shorten your kernLength, as is common in say motor imagery classification where the most useful features are generally in alpha (8-13Hz) and beta (13-30Hz) bands). In that example setting kernLength = 32 or 16 (representing 4Hz and 8Hz at 128Hz sampling rate) could produce better results. If you don't have a lot of data I'd think about fitting a smaller model (say F1,D,F2 = 4,2,8 respectively).

You could also try fitting a ShallowConvNet (also part of the codebase), this is a model that was used in (https://onlinelibrary.wiley.com/doi/full/10.1002/hbm.23730) and is very good at modeling frequency activity in EEG.

Jaewoong Moon · Answer 4 · Fri Dec 11 2020 04:58:32 GMT+0800 (China Standard Time)

I followed your advice and attempted to look for more frequency-specific model. I changed the kernel length to 7/256Hz --> roughly 30Hz and up (gamma) and the model performance improved:

Is there a reason why the performance changes with each time I run the code? Should it behave this way?

Also, how is validation accuracy calculated each epoch? For me it seems to climb and then start decreasing. Is this okay?

vlawhern · Answer 5 · Fri Dec 11 2020 05:03:15 GMT+0800 (China Standard Time)

Model variability run-to-run is unfortunately expected since Tensorflow in general is not deterministic; if you want to do some more reading on this you could check out the repo here: https://github.com/NVIDIA/framework-determinism. This issue is also exacerbated when you're working with small datasets such as EEG datasets.

vlawhern · Answer 6 · Fri Dec 11 2020 05:05:46 GMT+0800 (China Standard Time)

For the validation accuracy question, this is often why you use model checkpointing, saving the best model as determined either by validation accuracy or loss. The behavior you're observing is pretty normal; it suggests that you're overfitting to your training data

Jaewoong Moon · Answer 7 · Fri Dec 18 2020 01:13:04 GMT+0800 (China Standard Time)

If I perform multiple runs on the same data with the same checkpoint file, does this eventually improve model performance? I am finding my accuracies can reach up to 92% now. Is this normal/expected?

vlawhern · Answer 8 · Fri Dec 18 2020 01:35:09 GMT+0800 (China Standard Time)

So this question is a bit unclear. I'm assuming you're fitting multiple runs each from a random EEGNet initialization. Meaning:

Initialize the EEGNet model: model = EEGNet(...)
Fit the model and use the checkpoint callback to save the best model weights: model.fit(...)
Then load the model checkpoint weights and make a prediction on your test data: model.predict(...)

Each run repeats steps 1-3. I'm not sure what you mean by "multiple runs on the same data with the same checkpoint file"; each run produces its own checkpoint file, representing the best model fit for that particular run. If you want to fit another run you have to re-initialize the model (so that it starts from random weights).

vlawhern · Answer 9 · Tue Jul 06 2021 22:53:17 GMT+0800 (China Standard Time)

Closing due to inactivity.. please feel free to reopen if needed