I need help understanding the terminology in the docs

Question

I need help understanding the terminology in the docs

BrannonKing opened this issue 4 years ago · comments

The documentation commonly uses this tuple: (samples, timestamps). That doesn't make any sense in my brain as I've always thought of those being the same thing. If I'm sampling something, I'm reading a sensor value periodically. I could create a timestamp for that sample, but I also have the sensor's value at that time. My input data is (samples, sensor values). It has one row for each time I read the sensors, and a column for the value of each sensor. I think this is called the "wide" data format. Is pyts compatible with the wide data format? Or is there an easy way to transform my data into something compatible with pyts?

Johann Faouzi · Answer 1 · Wed Jan 20 2021 18:36:26 GMT+0800 (China Standard Time)

Hi,

The terminology and the API aim at being compatible with scikit-learn. In pyts, a sample is a time series (univariate or multivariate). As pyts is mainly focused on machine learning for time series, most algorithms require several time series as input. Thus, (almost) all the implementations in this package consider a set of time series (sensors) as input. If you only have one time series, you just have to create an iterable (list, tuple, numpy.array) with a single element.

It's not very clear to me if your sensor outputs univariate or multivariate values. If your sensor outputs univariate data, your data is one-dimensional. In this case, you need to make it two-dimensional (like turning a vector into a matrix with one row). If your sensor outputs multivariate data, your data is two-dimensional (one dimension for each feature, one dimension for the time). In that case, you need to make it three-dimensional (like turning a matrix into a tensor whose first dimension is equal to 1).

Let me know if my answer is clear or not. If you could give me a toy example (with fake data of course) and which algorithms you plan to use, I could probably give you a better explanation.

Brannon King · Answer 2 · Thu Jan 21 2021 00:07:57 GMT+0800 (China Standard Time)

Yes, my data is 2D. You're saying it should all work if I transform (time, sensor) into (1, time, sensor) ?

Johann Faouzi · Answer 3 · Thu Jan 21 2021 01:34:20 GMT+0800 (China Standard Time)

Well, it depends on which algorithms you want to use, because you cannot perform supervised learning if you only have one label. The last axis is always for the time, so it would be (sensor, time) (if you consider that you have several univariate time series) or (1, sensor, time) (if you consider that you have 1 multivariate time series). But with only one multivariate time series, you will not be able to perform supervised learning (and there are fewer tools implemented in this package for multivariate time series because the literature on multivariate time series classification is much scarcer than univariate time series classification).

Brannon King · Answer 4 · Thu Jan 21 2021 04:08:21 GMT+0800 (China Standard Time)

I have a label for every time step (excepting the present). In post-analysis (using future data), I know whether or not I should or should not have done something at that time. Sorry I didn't clarify that at the start. My goal is forecasting on a multivariate time series.

Johann Faouzi · Answer 5 · Thu Jan 21 2021 16:34:05 GMT+0800 (China Standard Time)

If you have a label for every time step and want to predict the label at a future time step, I think that this task is more commonly referred to as regression for (binary) time series. Here are a few references for binary time series:

Kedem, Benjamin, and Konstantinos Fokianos. Regression Models for Binary Time Series. Modeling Uncertainty: An Examination of Stochastic Theory, Methods, and Applications. https://doi.org/10.1007/0-306-48102-2_9.
Klingenberg, Bernhard. Regression Models for Binary Time Series with Gaps. Computational Statistics & Data Analysis. https://doi.org/10.1016/j.csda.2008.01.019.
Paul, Erina, Arnab Kumar Maity, and Raju Maiti. Bayesian Comparative Study on Binary Time Series. Journal of Statistical Computation and Simulation. https://doi.org/10.1080/00949655.2018.1488256.

Unfortunately pyts does not provide tools dedicated to this kind of tasks. You could try to turn your problem into a time series classification task, but it may not be optimal.