[BUG] `_channels_to_dataframe` assigns wrong timestamps if channels have different start times
ethan-tau opened this issue · comments
When reading a TDMS file with channels sampled at different rates, synchronization between channels is lost if _channels_to_dataframe
is called with absolute_time=False
. Each channel is read with a time series relative to itself, as opposed to relative to the first data point timestamp in all specified channels.
A proposed fix would be to read all channels with index = channel.time_track(absolute_time=True) if time_index else None
, then calculate a time relative to the group if the argument provided to _channels_to_dataframe
is absolute_time=False
.
For example, a TDMS file that has several channels sampled at three different data rates and six different starting times:
import nptdms
import pandas as pd
tdms_file_path = "my_local_data_file.tdms"
tdms_file = nptdms.TdmsFile.open(tdms_file_path)
groups = tdms_file.groups()
channels = groups[0].channels()
time_tracks = {}
for channel in channels:
try:
time_track = channel.time_track(absolute_time=True)
time_tracks[channel.name] = time_track
except KeyError:
pass
time_info_df = pd.DataFrame([(k, len(v), v[0], v[-1]) for k,v in time_tracks.items()], columns=['name','length','t_zero','t_final'])
Which results in
Row | name | samples | t_zero | t_final |
---|---|---|---|---|
0 | Channel A | 1000000 | 2022-02-01 01:30:33.268115 | 2022-02-01 01:30:34.268113999 |
1 | Channel B | 1000000 | 2022-02-01 01:30:33.268115 | 2022-02-01 01:30:34.268113999 |
2 | Channel C | 62500 | 2022-02-01 01:30:33.268126 | 2022-02-01 01:30:34.268109999 |
3 | Channel X | 62500 | 2022-02-01 01:30:33.268130 | 2022-02-01 01:30:34.268113999 |
4 | Channel X | 62500 | 2022-02-01 01:30:33.268134 | 2022-02-01 01:30:34.268117999 |
5 | Channel X | 50 | 2022-02-01 01:30:33.058532 | 2022-02-01 01:30:34.038532000 |
6 | Channel X | 50 | 2022-02-01 01:30:33.062192 | 2022-02-01 01:30:34.042192000 |
7 | Channel X | 50 | 2022-02-01 01:30:33.065852 | 2022-02-01 01:30:34.045852000 |
_channels_to_dataframe
with default arguments left-justifies all the data on a channel-by-channel basis. This results in an event that is sampled at an absolute time of 01:30:33.058532
on one channel to be reported at the same relative time (time = 0
) as an event that is sampled at time 01:30:33.268115
.
def _channels_to_dataframe(channels_to_export, time_index=False, absolute_time=False, scaled_data=True):
import pandas as pd
dataframe_dict = OrderedDict()
for column_name, channel in channels_to_export.items():
index = channel.time_track(absolute_time) if time_index else None
if scaled_data:
dataframe_dict[column_name] = pd.Series(data=_array_for_pd(channel[:]), index=index)
elif channel.scaler_data_types:
# Channel has DAQmx raw data
raw_data = channel.read_data(scaled=False)
for scale_id, scaler_data in raw_data.items():
scaler_column_name = column_name + "[{0:d}]".format(scale_id)
dataframe_dict[scaler_column_name] = pd.Series(data=scaler_data, index=index)
else:
# Raw data for normal TDMS file
raw_data = channel.read_data(scaled=False)
dataframe_dict[column_name] = pd.Series(data=_array_for_pd(raw_data), index=index)
return pd.DataFrame.from_dict(dataframe_dict)
Proposed implementation:
def _channels_to_dataframe(channels_to_export, time_index=False, absolute_time=False, scaled_data=True):
import pandas as pd
dataframe_dict = OrderedDict()
for column_name, channel in channels_to_export.items():
# This try/except block deals with a particular group of TDMS files
# I encountered that don't play nicely. Maybe it's better to allow the
# code to throw an exception, rather than silently discarding the channels.
# If that's the case, only keep the `index = channel.....` line
try:
index = channel.time_track(absolute_time=True) if time_index else None
except KeyError as e:
if time_index==True:
continue
else:
index = None
if scaled_data:
dataframe_dict[column_name] = pd.Series(data=_array_for_pd(channel[:]), index=index)
elif channel.scaler_data_types:
# Channel has DAQmx raw data
raw_data = channel.read_data(scaled=False)
for scale_id, scaler_data in raw_data.items():
scaler_column_name = column_name + "[{0:d}]".format(scale_id)
dataframe_dict[scaler_column_name] = pd.Series(data=scaler_data, index=index)
else:
# Raw data for normal TDMS file
raw_data = channel.read_data(scaled=False)
dataframe_dict[column_name] = pd.Series(data=_array_for_pd(raw_data), index=index)
df = pd.DataFrame.from_dict(dataframe_dict)
if (time_index==True and absolute_time==False):
df.index -= df.index[0]
return df
Hi, thanks for the bug report. Yes I think your proposed behaviour makes more sense, although it would be a breaking change as other people may be relying on the existing behaviour. So rather than changing the behaviour, it would be more appropriate to add a new parameter to control this behaviour and make the new approach opt-in. Then for a future 2.0 release we could consider making the new behaviour the default.