starting with "wide" data

Question

starting with "wide" data

BrannonKing opened this issue 3 years ago · comments

If I start with the wide data format, a 2d array of samples (rows) by sensor readings (columns), what is the right way to transform that to fit the requirements of this library?

angus924 · Answer 1 · Wed Jan 20 2021 05:49:13 GMT+0800 (China Standard Time)

Hi @BrannonKing, if I understand correctly, your data is already in the right format. A 2d numpy array (np.float32), each row is a time series, for a given row (time series), each column represents a reading at a particular point in time. In other words, we could take a row from the array, plot it, and we would be plotting a single time series.

Does this make sense? I may be misunderstanding the issue.

Brannon King · Answer 2 · Wed Jan 20 2021 11:39:05 GMT+0800 (China Standard Time)

I'm glad it's in the right format; I'll give it a go. According to Google, a "time series" is a "a series of values of a quantity obtained at successive times". Hence, we would never refer to simultaneous sensor readings as a "time series", true? I still think it's a strange and likely incorrect use of that term.

angus924 · Answer 3 · Wed Jan 20 2021 12:46:02 GMT+0800 (China Standard Time)

Hi @BrannonKing, sorry, I think I misunderstood what you were saying, and so what I said before may be wrong. Could you describe your task and data a bit more, and I can think about how to make it work?

E.g., does each column represent a different sensor?

Brannon King · Answer 4 · Thu Jan 21 2021 00:09:03 GMT+0800 (China Standard Time)

Yes, each column represents a different sensor but they're all read and logged at the same time. I have a separate "label" for each row (that can be computed based on future data).

angus924 · Answer 5 · Thu Jan 21 2021 10:17:53 GMT+0800 (China Standard Time)

Ok. Would you mind if I ask you a couple more questions, so that I can understand a bit better?

Each row is a different set of readings for all the sensors? Is there a clear temporal relationship between the rows (e.g., row 2 is a set of readings taken 5 minutes after the set of readings in row 1)?
What is the task? Are you trying to classify new/unseen rows?
Is there some kind of spatial relationship between the sensors which is reflected in the ordering of the columns or, to put it another way, would it make any difference if you shuffled the order of the columns?

MiniRocket is intended to extract features from 1d data where there is some kind of temporal (or spatial) relationship between values. (You don't need labels necessarily, depending on your task.) If your rows represent different sets of readings over time, then each column would be a time series. Alternatively, if there is a spatial relationship between the sensors/columns then, while not a time series as such, there would be the same kind of underlying structure.

Brannon King · Answer 6 · Fri Jan 22 2021 01:37:13 GMT+0800 (China Standard Time)

Yes, there is a temporal relationship between the rows. They come x minutes apart, and I was trying to predict the coming row. There is no relationship between the sensors; I could reorder the columns. If I understand what you're saying, I should create the features for each column separately.

angus924 · Answer 7 · Fri Jan 22 2021 19:22:22 GMT+0800 (China Standard Time)

Sorry for the slow reply.

Alright, so if I understand correctly, you can treat each column as a time series. In order to use MiniRocket, you should transpose your input. You'll need to choose an appropriate model to train using the features produced by MiniRocket, e.g., a regressor (such as RidgeCV from scikit-learn) if you want to predict the next reading for each sensor.

A couple of things to keep in mind:

The current implementation of MiniRocket is not optimised for streaming data, so it could be very inefficient if you have to reproduce the features every time you acquire new sensor readings (ideally, you would just update the features with minimal computational expense as the next reading comes in).
The features produced by MiniRocket may or may not be effective for your task (which sounds like regression/forecasting, rather than classification).

In any case, please let me know if you have any further questions, or if I can help in any way. I'd be very interested to hear whether or not MiniRocket is effective for you.