[BUG] `_channels_to_dataframe` assigns wrong timestamps if channels have different start times

Question

[BUG] `_channels_to_dataframe` assigns wrong timestamps if channels have different start times

ethan-tau opened this issue 2 years ago · comments

When reading a TDMS file with channels sampled at different rates, synchronization between channels is lost if _channels_to_dataframe is called with absolute_time=False. Each channel is read with a time series relative to itself, as opposed to relative to the first data point timestamp in all specified channels.

A proposed fix would be to read all channels with index = channel.time_track(absolute_time=True) if time_index else None, then calculate a time relative to the group if the argument provided to _channels_to_dataframe is absolute_time=False.

For example, a TDMS file that has several channels sampled at three different data rates and six different starting times:

import nptdms
import pandas as pd
tdms_file_path = "my_local_data_file.tdms"
tdms_file = nptdms.TdmsFile.open(tdms_file_path)
groups = tdms_file.groups()
channels = groups[0].channels()
time_tracks = {}
for channel in channels:
    try:
        time_track = channel.time_track(absolute_time=True)
        time_tracks[channel.name] = time_track
    except KeyError:
        pass
time_info_df = pd.DataFrame([(k, len(v), v[0], v[-1]) for k,v in time_tracks.items()], columns=['name','length','t_zero','t_final'])

Which results in

Row	name	samples	t_zero	t_final
0	Channel A	1000000	2022-02-01 01:30:33.268115	2022-02-01 01:30:34.268113999
1	Channel B	1000000	2022-02-01 01:30:33.268115	2022-02-01 01:30:34.268113999
2	Channel C	62500	2022-02-01 01:30:33.268126	2022-02-01 01:30:34.268109999
3	Channel X	62500	2022-02-01 01:30:33.268130	2022-02-01 01:30:34.268113999
4	Channel X	62500	2022-02-01 01:30:33.268134	2022-02-01 01:30:34.268117999
5	Channel X	50	2022-02-01 01:30:33.058532	2022-02-01 01:30:34.038532000
6	Channel X	50	2022-02-01 01:30:33.062192	2022-02-01 01:30:34.042192000
7	Channel X	50	2022-02-01 01:30:33.065852	2022-02-01 01:30:34.045852000

_channels_to_dataframe with default arguments left-justifies all the data on a channel-by-channel basis. This results in an event that is sampled at an absolute time of 01:30:33.058532 on one channel to be reported at the same relative time (time = 0) as an event that is sampled at time 01:30:33.268115.

Present implementation:

def _channels_to_dataframe(channels_to_export, time_index=False, absolute_time=False, scaled_data=True):
    import pandas as pd


    dataframe_dict = OrderedDict()
    for column_name, channel in channels_to_export.items():
        index = channel.time_track(absolute_time) if time_index else None
        if scaled_data:
            dataframe_dict[column_name] = pd.Series(data=_array_for_pd(channel[:]), index=index)
        elif channel.scaler_data_types:
            # Channel has DAQmx raw data
            raw_data = channel.read_data(scaled=False)
            for scale_id, scaler_data in raw_data.items():
                scaler_column_name = column_name + "[{0:d}]".format(scale_id)
                dataframe_dict[scaler_column_name] = pd.Series(data=scaler_data, index=index)
        else:
            # Raw data for normal TDMS file
            raw_data = channel.read_data(scaled=False)
            dataframe_dict[column_name] = pd.Series(data=_array_for_pd(raw_data), index=index)
    return pd.DataFrame.from_dict(dataframe_dict)

Proposed implementation:

def _channels_to_dataframe(channels_to_export, time_index=False, absolute_time=False, scaled_data=True):
    import pandas as pd

    dataframe_dict = OrderedDict()
    for column_name, channel in channels_to_export.items():
        # This try/except block deals with a particular group of TDMS files 
        # I encountered that don't play nicely. Maybe it's better to allow the 
        # code to throw an exception, rather than silently discarding the channels.
        # If that's the case, only keep the `index = channel.....` line
        try:
            index = channel.time_track(absolute_time=True) if time_index else None
        except KeyError as e:
            if time_index==True:
                continue
            else:
                index = None
        if scaled_data:
            dataframe_dict[column_name] = pd.Series(data=_array_for_pd(channel[:]), index=index)
        elif channel.scaler_data_types:
            # Channel has DAQmx raw data
            raw_data = channel.read_data(scaled=False)
            for scale_id, scaler_data in raw_data.items():
                scaler_column_name = column_name + "[{0:d}]".format(scale_id)
                dataframe_dict[scaler_column_name] = pd.Series(data=scaler_data, index=index)
        else:
            # Raw data for normal TDMS file
            raw_data = channel.read_data(scaled=False)
            dataframe_dict[column_name] = pd.Series(data=_array_for_pd(raw_data), index=index)
    df = pd.DataFrame.from_dict(dataframe_dict)
    if (time_index==True and absolute_time==False):
        df.index -= df.index[0]
    return df

Adam Reeve · Answer 1 · Sun Feb 20 2022 08:04:49 GMT+0800 (China Standard Time)

Hi, thanks for the bug report. Yes I think your proposed behaviour makes more sense, although it would be a breaking change as other people may be relying on the existing behaviour. So rather than changing the behaviour, it would be more appropriate to add a new parameter to control this behaviour and make the new approach opt-in. Then for a future 2.0 release we could consider making the new behaviour the default.