This is the repository connected to the paper "Grouped Multi-Layer Echo State Networks with Self-Normalizing Activations" (pdf available here). Below, we present additional information and experiments that were performed.
- Usage
- Architecture overview
- Grid Search Configuration
- Best Architecture Configuration
- Memory Capacity
Make sure you have torch installed from https://pytorch.org/get-started/locally/
Run
pip install auto-esn
Then navigate to simple example, read descriptions, copy code to your project and run! For more complex examples explore examples folder. More "formal" documentation is undergo. In the meantime, if you have any questions or requests don't hesitate to raise an issue.
Four different types of Echo State Networks were tested: shallow ESN, deep ESN(dESN), grouped ESN(gESN) and the generalisation of all of them which is grouped deep Echo State Network (gdESN).
Briefly, Deep Echo State Network stacks several reservoirs one on top of another. The difference to classical deep neural network is that the output of all intermediate layers is concatenated giving the final result.
Grouped ESN consist of a group of shallow ESNs whose outputs are concatenated to create final output.
Grouped Deep ESN puts both these approaches together by creating a group of Deep ESNs. The output of gdESN is concatenated output of all its reservoirs.
Tanh activation function highly depends on input scaling, while for SNA(self-normalizing activation) its effect is reduced due to normalization factor. This is why we used two different hyperparameter setups as an input to grid search. For each configuration, 5 concrete models were generated with different random seeds applied for weight initialization. For gESN and dESN architectures Ng,Nl∈ {2,3,4,5,10}, where Ng is number of groups and Nl number of layers. For gdESN architectures each of{(2,2),(2,3),(2,4),(2,5),(3,2),(3,3),(3,4),(4,2),(4,3),(5,2)}configurations of (groups, layers) was used.Each tested model, shallow or decoupled had the total number of 1000 neurons (with the small deviations resulting from subreservoir integer sizes). Grid search optimization was performed on all these hyperparameters and architectures. The best configuration was selected based on minimal NRMSE score obtained on validation set. In the main part of experiment, which includes 1-step ahead prediction of time-series, the average and minimal NRMSE on the test set was calculated for each architecture and the target hyperparameter set.
Hyperparameter | tanh | SNA |
Input Scaling s | {0.1,0.5,1.0,10} | 1.0 |
Spectral Radius ρ | {0.7,0.8,0.9,1.0} | 1.0 |
Leaking Rate α | {0.7,0.8,0.9,1.0} | 1.0 |
Regularization β | {0.5,1,2} | {0.5,1,2} |
Actiation Radius r | - | {50k | k∈1,2,3...30} |
Washout | 100 | |
Total neurons | 1000 | |
Sparsity | 10% | |
Weight distribution | uniform, centered around 0 |
Hyperparameter sets for best results in different tasks are listed below:
MackeyGlass Series one step ahead prediction:
Architecture | Layers | Groups | Activation Radius | Regularization |
---|---|---|---|---|
ESN tanh | 1 | 1 | None | 0.5 |
dESN tanh | 4 | 1 | None | 0.5 |
gESN tanh | 1 | 2 | None | 0.5 |
gdESN tanh | 2 | 4 | None | 0.5 |
---------------------- | -------- | -------- | ------------------- | ---------------- |
ESN SNA | 1 | 1 | 100 | 0.5 |
dESN SNA | 3 | 1 | 100 | 0.5 |
gESN SNA | 1 | 10 | 200 | 0.5 |
gdESN SNA | 3 | 3 | 50 | 0.5 |
Multiple Superimposed Oscillators one step ahead prediction:
Architecture | Layers | Groups | Activation Radius | Regularization |
---|---|---|---|---|
ESN tanh | 1 | 1 | None | 0.5 |
dESN tanh | 2 | 1 | None | 0.5 |
gESN tanh | 1 | 2 | None | 0.5 |
gdESN tanh | 3 | 3 | None | 0.5 |
---------------------- | -------- | -------- | ------------------- | ---------------- |
ESN SNA | 1 | 1 | 1400 | 0.5 |
dESN SNA | 4 | 1 | 1200 | 1.0 |
gESN SNA | 1 | 3 | 1300 | 1.0 |
gdESN SNA | 3 | 2 | 1400 | 1.0 |
Sunspot Series one step ahead prediction:
Architecture | Layers | Groups | Activation Radius | Regularization |
---|---|---|---|---|
ESN tanh | 1 | 1 | None | 2.0 |
dESN tanh | 2 | 1 | None | 2.0 |
gESN tanh | 1 | 20 | None | 1.0 |
gdESN tanh | 3 | 3 | None | 2.0 |
---------------------- | -------- | -------- | ------------------- | ---------------- |
ESN SNA | 1 | 1 | 850 | 1.0 |
dESN SNA | 3 | 1 | 1400 | 1.0 |
gESN SNA | 1 | 8 | 400 | 2.0 |
gdESN SNA | 3 | 2 | 1450 | 1.0 |
Additionally, several experiments with moving average and LSTM networks were performed.
For LSTM 4 architectures were tested
Architecture 1
Architecture 2
Architecture 3
Architecture 4
Each architecture was trained for 100 epochs with Adam optimiser. Each model was trained 5 times for each learning rate of [0.001, 0.002, 0.0005,0.005].
Best results were obtained with:
- Architecture 2 and learning rate 0.002 for MackeGlass
- Architecture 3 and learning rate 0.002 for Sunspot
- Architecture 4 and learning rate 0.005 for Multiple Superimposed oscillators
The objective of this experiment was to measure how ESN models can recall past input seen by reservoir N steps ago, where N∈{1,5,10,15,...100}. In particular, we compare different decoupled SNA architectures based on the hyperparameter configuration established in the one step ahead preadiction for MG dataset. For each N , N steps behind NRMSE was averaged over 10 trials. Several phenomena observed in the results of the experiment are depicted in Figures below.
In case of dESN SNA architecture, memory capacity decreases faster for bigger number of layers. It is important to note that the total number of neurons remains the same(1000).
Similar observation can be done for gESN SNA architecture, however here, the influence of adding more groups on the memory is lower.
These conclusions find further confirmation for gdESN, where the smaller the product of layers and groups, the better the memory capacity. In case of gdESN SNA, we observe that two-dimensional configuration of layers and groups (e.g. (2,4) vs. (4,2)) affects memory characteristics and can be used to find trade-off between memory and representational power.
In general, two clean patterns emerge - one bigger reservoir has better Memory Capacity than two either stacked or grouped as long as total number of neurons stays the same. Grouped reservoirs have better memory capacity than Deep ones for the same configuration, eg. 2 layers 500 neurons each vs two groups 500 neurons each.
For SNA architectures only the one hyperparameter from the set: {Input scaling s, Acativation radius r, Spectral radius ρ} can be adjusted. The others can be fixed with no negative influence on the model memory. This conclusion comes from normalization effect of activation radius r and is confirmed by the empirical results presented below and obtained for SNA ESN, with divsr∈ {10,20,30,40}, where divsr=s/r. For each divsr, 20 different values of s and r were selected. For each of these configurations, the models with the same initial weights were trained in 10 trials. As presented in Figure below the models with the same divsr exhibit practically the same memory capacity.
The figure below presents Memory Capacity NRMSE scores for recalling input seen 30 steps before for ESN with fixed ratio of divsr, where divsr∈ {10,20,30,40}. For each divsr, 20 different values of s and r were chosen. All experiments were conducted on SNA ESN with fixed weights (s and r were changing at the begging of the training but not the weights!). 10 diferent SNA ESNs were initialized with different weights and the results were averaged for each configuration.