Singular Spectrum Analysis decomposition method

Question

Singular Spectrum Analysis decomposition method

royalii opened this issue 2 years ago · comments

Description

Hello, I am new to python. I am trying to use the SSA decomposition method for rainfall prediction with a dataset of 21500 rows and 5 columns (21500, 5).
I used the source codes below. But I do not know how to fix it for my dataset. I have an error when changing the value of window size, n_sample, and n_timestamps. Anyone can help me?
How can I use the main step of SSA including embedding, SVD, and reconstruction?

Steps/Code to Reproduce

<< import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from pyts.decomposition import SingularSpectrumAnalysis

Parameters

n_samples, n_timestamps = 100, 48
df = pd.read_csv('C:/Users/PC2/Desktop/passenger.csv', index_col=0)

Toy dataset

rng = np.random.RandomState(41)
X = rng.randn(n_samples, n_timestamps)

We decompose the time series into three subseries

window_size = 15
groups = [np.arange(i, i + 5) for i in range(0, 11, 5)]

Singular Spectrum Analysis

ssa = SingularSpectrumAnalysis(window_size=15, groups=groups)
X_ssa = ssa.fit_transform(X)

Show the results for the first time series and its subseries

plt.figure(figsize=(16, 6))

ax1 = plt.subplot(121)
ax1.plot(X[0], 'o-', label='Original')
ax1.legend(loc='best', fontsize=14)

ax2 = plt.subplot(122)
for i in range(len(groups)):
ax2.plot(X_ssa[0, i], 'o--', label='SSA {0}'.format(i + 1))
ax2.legend(loc='best', fontsize=14)

plt.suptitle('Singular Spectrum Analysis', fontsize=20)

plt.tight_layout()
plt.subplots_adjust(top=0.88)
plt.show()>>




<Thank you>

royalii · Answer 1 · Tue Feb 15 2022 13:21:15 GMT+0800 (China Standard Time)

You can use the attached file
ALLDATA_KWRA.xlsx
.

Lucas Plagwitz · Answer 2 · Wed Mar 02 2022 08:00:27 GMT+0800 (China Standard Time)

Hi,
maybe you've already made some progress, but if not, here's a quick example of how to apply the SSA to your data.

The most important step is to reshape the data to the right size. Pyts expects a similar format to Scikit-Learn: (n_samples x n_timesteps). So with your data (4 x 21500). This adjustment allows the SSA to be calculated correctly. But there is another important point with your data, due to the temporal resolution per day a larger window size is advisable. Since this can quickly lead to a high memory consumption, you should combine it with a simple subsampling (e.g. only every fourth day). Now you can easily decompose the temperature time series into a trend, a periodic course, and a residual by the groups parameter. If the calculation leads to a memory error, do not consider the whole time series, consider less days per year, or reduce the window_size.

Here the code:

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from pyts.decomposition import SingularSpectrumAnalysis

df = pd.read_excel("ALLDATA_KWRA.xlsx")

window_size = 400

X, year = df[["Humidity", "Temp", "Wind", "PRCP"]].values.T[:, ::4], df["date"].dt.year.tolist()[::4]

ssa = SingularSpectrumAnalysis(window_size=window_size, groups=[[0], np.arange(1,20), np.arange(20,400)])
X_ssa = ssa.fit_transform(X)

# -> only plotting
plt.figure(figsize=(15, 8))

plt.subplot(121)
plt.plot(X[1, :756], label="original")
plt.plot(X_ssa[1, 0, -756:], label="trend - 10s")
plt.plot(X_ssa[1, 0, :756], label="trend - 70s")
plt.plot(X_ssa[1, 1, :756], label="periodic")
plt.plot(X_ssa[1, 2, :756], label="rest")
plt.legend()
plt.xlabel("days (every fourth)")
plt.ylabel("temperature")
plt.title(f"Year {min(year[:756])} - {max(year[:756])}")
plt.ylim([-20, 35])

plt.subplot(122)
plt.plot(X[1, -756:], label="original")
plt.plot(X_ssa[1, 0, -756:], label="trend - 10s")
plt.plot(X_ssa[1, 0, :756], label="trend - 70s")
plt.plot(X_ssa[1, 1, -756:], label="periodic")
plt.plot(X_ssa[1, 2, -756:], label="rest")
plt.xlabel("days (every fourth)")
plt.title(f"Year {min(year[-756:])} - {max(year[-756:])}")
plt.legend()
plt.ylim([-20, 35])

This produces the following result:

I hope this could help a little. Otherwise it shows at least another nice application of the SSA.
As a substitute for simple subsampling, other methods such as pyts PiecewiseAggregateApproximation are also conceivable.

Best,
Lucas

Johann Faouzi · Answer 3 · Wed Mar 02 2022 16:43:56 GMT+0800 (China Standard Time)

Hi,

I'm very sorry for the delayed response, I totally missed the notification of your issue. Lucas gave you a complete working example that should help you get started.

On the main branch there are a few new functionalities for singular spectrum analysis.

First, you can use the chunksize parameter to preprocess the time series using chunks in order to decrease memory usage. I ran Lucas's example and it approximately required 10Gb RAM, which may be too much for you depending on the ressources available. You could set chunksize =1 to decompose the time series one by one to decrease memory usage.

Another functionality is the automatic computation of the groups using groups="auto" to get the 3 time series: the trend, the seasonality, and the residuals.

Here is the link to the latest version of singular spectrum analysis: https://pyts.readthedocs.io/en/latest/generated/pyts.decomposition.SingularSpectrumAnalysis.html#pyts.decomposition.SingularSpectrumAnalysis. To use it, you need to install the latest version of the package:

git clone https://github.com/johannfaouzi/pyts.git
cd pyts
pip install .

I hope that this is still helpful to you and would be happy to answer any of your questions.

Best,
Johann

orowaletriumph · Answer 4 · Fri Mar 03 2023 19:01:27 GMT+0800 (China Standard Time)

Great results above, I was able to run the example above with the data given but when I tried to replace it with my data set, I have several issues.

Please I was trying to replicate this code to my data set on dam operations but I have been having several issues such as "If 'window_size' is an integer, it must be greater than or equal to 2 and lower than or equal to n_timestamps (got 400)."

Please can you help me rectify this?

attached is the my file
bookk.csv

Thanks

Lucas Plagwitz · Answer 5 · Sun Mar 05 2023 06:15:31 GMT+0800 (China Standard Time)

Hi,
nice that you tried the example above!

Your mentioned error is caused by the fact that your data has a maximum length of n_timestamps=360, so a window_size=400 leads to the error. As soon as you set the window_size <= 360, it should work. I have adapted the code snippet from above to your data. I used the groups="auto" parameter (mentioned above) to automatically divide the components into trend/periodic/rest. To avoid using unnecessary memory, I set the window_size=50.

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import matplotlib.gridspec as gridspec
from pyts.decomposition import SingularSpectrumAnalysis


df = pd.read_csv("bookk.csv")
df["Camontdeb(m)"] = df["Camontdeb(m)"].str.replace("..", ".", regex=False).astype(float)

tested_columns = ["V Turbiné (hm3)", "Camontdeb(m)",
                  "E. A. Produite (MWh)", "E. R. Produite", 
                  "Indice Evapo (mm)"]

X, year = df[tested_columns].values.T, pd.to_datetime(df["Date"]).dt.year.tolist()

window_size = 50
ssa = SingularSpectrumAnalysis(window_size=window_size, groups="auto")
X_ssa = ssa.fit_transform(X)

# -> only plotting
plt.figure(figsize=(10, 8))
gs = gridspec.GridSpec(2, 2)

plt.subplot(gs[0, 0])
col = 0
plt.plot(X[col, :], label="original", alpha=.4)
plt.plot(X_ssa[col, 0, :], label="trend")
plt.plot(X_ssa[col, 1, :], label="periodic")
plt.plot(X_ssa[col, 2, :], label="rest")
plt.legend()
plt.xlabel("month")
plt.title(f"{tested_columns[col]}")

plt.subplot(gs[0, 1])
col = 1
plt.plot(X[col, :] - 200, label="original - 200", alpha=.4)
plt.plot(X_ssa[col, 0] - 200, label="trend - 200")
plt.plot(X_ssa[col, 1], label="periodic")
plt.plot(X_ssa[col, 2], label="rest")
plt.xlabel("month")
plt.title(f"{tested_columns[col]}")
plt.legend()

plt.subplot(gs[1, :])
plt.title("Scaled and convolved frequencies per feature")
for i, col in enumerate(tested_columns):
    freq = np.fft.fftfreq(360, 1/12)[:180]
    fft = np.abs(np.fft.fft(X_ssa[i, 1]))[:180]
    fft = np.convolve(fft, np.array([1]*3)/3, "same")
    plt.plot(freq, fft / np.max(fft), label=col)
plt.xlim([0, 4])
plt.legend()
plt.tight_layout()

This will give us a first impression about possible periodicity of your data:

I hope this helps as a starting point for further analysis. At this point I would also find it interesting to learn more about your data. Or rather, what is the application behind it?

Best,
Lucas

orowaletriumph · Answer 6 · Fri Apr 14 2023 00:59:52 GMT+0800 (China Standard Time)

Good evening How are you doing sir I hope all is well? I would need your assistance please to forecast using the SSA.Thanks Triumph

orowaletriumph · Answer 7 · Fri Apr 14 2023 01:02:31 GMT+0800 (China Standard Time)

Bonjour Johann.Greetings Wow, I am so, so so grateful for this beautiful response I am so happy God bless you. It looks like you worked on the data set because I tried implementing and it was giving me this output error 'groups' must be either None, an integer or array-like.” If you can share the data set I will appreciate it. groups=”auto” seems not to work for me. But when I do groups = 5 or any number less than the window size I get results. Attached is the attempt @ groups=5. This are the interested columns for now; Date; VÃ©vap.; Virrig., VdÃ©versÃ©; Ventrant; Camontdeb(m); H pluie (mm) and Indice Evapo (mm). I still have some data set which I would like to use the SSA for statistical analysis. I would love to work with you on my data analysis, I am working on my masters research on the bagre dam operation the impact of rainfall extremes and geological features on the spill way and embarkment of the dam downstream. So with this data set on the dam operation, I want to look at the statistical analysis using (SSA) to get the trend, periodicity and noise in the data for projection and to help in directing my Modeling. I will communicate with my supervisor on this results and come up with the next line of action. Can we arrange a zoom or google meet so I can discuss with you more about the data set? Seriously you did a great job here and I am so happy for this assistance.