promised-ai / lace

A probabalistic ML tool for science

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

read in date-time columns

firekg opened this issue · comments

A lot of data have date (or data-time) columns. Right now lace treats it as categorical (as they are str). This is not ideal both in terms of the number of date/time it can represent and in terms of its actual semantics (more like a continuous variable).

I think we should definitely handle dates and date times. Internally, we'd have to represent date times as some sort of collection columns that break down the components. For example, we'd have

  • day of week: categorical?
  • day of month: categorical?
  • year: count?
  • etc

Then there is the cyclic nature of dates and times. sunday is close to saturday, but sunday = 0 and saturday = 6.

We should think about how we would represent this. We also need to represent it so that it can be exactly converted back into a date(time). Please add suggestions. If we need to add a new model (e.g. cyclic) to make this work, feel free to propose it. We can add another issue for that.

Hi! I came here to create a similar issue around date-time columns but I think this captures it.

I've been experimenting with the Lace package and I'm really enjoying it. Most of the data I work with is time-series sensor data so am looking for a recommendation on how to optimally prepare that data for Lace.

Is it better to leave as is (categorical as noted above), convert to a sequential integer index, or to break out into several features like augment_timeseries_signature or tsfresh?

Maybe it all depends on my use-case but wanted to get your thoughts.

Thanks!

Hi @joshualeond - glad you're enjoying lace!

The rows of the table are modeled as independent observations, so the way we typically do timeseries is by keeping a certain amount of history and lookahead in the columns. For example, for sensor data a row might look like this

time_at_t0, t_minus_n, .., t_minus_2, t0, t_plus_1, ..., t_plus_m, ...

here i've used n to represent the number of timesteps back and m for the number of timesteps forward. You can of course use whatever granularity of data you like.

The best way to represent a datetime depends on your application. You might represent it as the number of hours since an experiment started, or you can break it into several features depending on what components of the datetime share information with the things you're interested in. You could do a categorical day of the week or a float proportion of the week. It all depends. If the cyclic nature of days/weeks/months/years is important, you can use sin and cos on the proportion*2pi.