Reading huge amount of data takes a lot of time

Question

Reading huge amount of data takes a lot of time

TenchiMuyo1984 opened this issue 3 years ago · comments

TenchiMuyo1984 commented 3 years ago

Hello,
I don't know why, but when I try to read out a huge amount of data, it'll takes several minutes.

`from nptdms import TdmsFile
tdms_file = TdmsFile.open("file.tdms", "r")

that takes a lot of time:

l_data = tdms_file[group_name][channel_name][0:5000000]`

This takes a lot of time as well:

`import tempfile

with tempfile.TemporaryDirectory() as temp_memmap_dir:
tdms_file = TdmsFile.read("file.tdms", memmap_dir=temp_memmap_dir)`

Is there a way to get the offset (adress of pointer) of every element in the file?
like:
p_element = tdms_file[group_name][channel_name][0].tell()

Adam Reeve · Answer 1 · Sun Sep 05 2021 10:24:53 GMT+0800 (China Standard Time)

Hi

The time it takes to read data will depend a lot on the TDMS file structure and the type of data being read, eg. timestamp data reading is more complicated than reading plain floats, and data with many small segments will take longer to read. Interleaved data will also take longer than non-interleaved data as the data points for a single channel are not contiguous.

It's hard to say without an example file to use for profiling why your specific files would take a long time to read. Eg. if I read data from a file with a similar number of data points but only a single channel of floats, this takes < 1 second on my machine. Are you able to provide one of your files for testing?

There isn't a way to get the positions of all data points, and I'm doubtful that this would allow you to read the data any faster.

numpy/numpy#13319 indicates that using np.fromfile can be a lot slower than expected when reading data in small chunks in Python 3, so possibly using f.readinto(buffer) or np.frombuffer as suggested there may help improve performance.

Adam Reeve · Answer 2 · Sun Sep 05 2021 11:18:07 GMT+0800 (China Standard Time)

I've pushed a change to the fromfile_perf branch that uses file.readinto instead of np.fromfile. Are you able to test with that branch to see if it improves performance?

axel-kah · Answer 3 · Mon Oct 25 2021 20:03:38 GMT+0800 (China Standard Time)

Hi,
npTDMS is one of two actively maintained tdms python packages, and the only one with partial read support, I did some benchmarking and compared this branch vs. last relase (1.3.1). I used two different files to test, unfortunately I can't make them available, so I added some meta info to give you a better idea of how they are structured. Test conditions:

Thinkpad T580
Windows 10
Defender enabled
Python 3.9.6
numpy 1.21.2

file: larger.tdms
groups: 1106
avg. ch/group: 20.0
filesize: 216.900 MB

npTDMS		1.3.1
$ python -m timeit --verbose -n 2 -r 3 -s "from nptdms import TdmsFile" "file = TdmsFile.read('larger.tdms')"
raw times: 201 sec, 203 sec, 169 sec

2 loops, best of 3: 84.6 sec per loop

npTDMS		090ed793271b6824b60105c198e55ef6be8f67b7
python -m timeit --verbose -n 2 -r 3 -s "from nptdms import TdmsFile" "file = TdmsFile.read('larger.tdms')"
raw times: 99.3 sec, 91.7 sec, 118 sec

2 loops, best of 3: 45.9 sec per loop

file: smaller.tdms
groups: 529
avg. ch/group: 20.0
filesize: 26.058 MB

npTDMS		1.3.1
$ python -m timeit --verbose -n 10 -r 5 -s "from nptdms import TdmsFile" "file = TdmsFile.read('smaller.tdms')"
raw times: 10.2 sec, 9.64 sec, 9.94 sec, 9.55 sec, 9.67 sec

10 loops, best of 5: 955 msec per loop

npTDMS		090ed793271b6824b60105c198e55ef6be8f67b7
$ python -m timeit --verbose -n 10 -r 5 -s "from nptdms import TdmsFile" "file = TdmsFile.read('smaller.tdms')"
raw times: 7.52 sec, 6.95 sec, 7.3 sec, 7.27 sec, 6.89 sec

10 loops, best of 5: 689 msec per loop

If you don't see any downside to this feature branch, then the numbers are definitely in favor and would be much appreciated.

Adam Reeve · Answer 4 · Tue Oct 26 2021 04:07:21 GMT+0800 (China Standard Time)

Hi @axel-kah, thanks for testing this. I also found some reasonable speed ups in my tests and don't see a reason not to make this change so will merge that branch.

nikolai · Answer 5 · Fri Nov 12 2021 20:02:25 GMT+0800 (China Standard Time)

A comment that might help the original poster:

I have used 2 GB tdms files without any problems with nptdms.
However I noticed depending on how the labview program is written tdms files can "get fragmented" (maybe this is interleaved data?) and this will slow down reading a lot. An indication if you have such a fragmentation issue is that the tdms index file is large (MB instead if KB).

Pasha Lyubarsky · Answer 6 · Tue Feb 22 2022 21:37:28 GMT+0800 (China Standard Time)

I have the same issue with a 260 MB file, one group and 499 channels 86400 rows long floats. On my hp elite x2 laptop it takes ~90 sec per channel. Probably it's "bad" file issue.
The only solution I found is to use multiprocessing Pool().map() method to read the channels in parallel.

def read_TDMS_Data_Parallel(channelList):
    p = Pool()
    result = p.map(read_TDMS_Data, channelList)
    p.close()
    p.join()
    return result

I also don't use built in *.as_dataframe() method. I create the DataFrame from the list returned from read_TDMS_Data_Parallel() function:

def store_DataFrame(Data, channelList, filePath):
    df = pd.DataFrame(data=Data).T
    df.columns = channelList
    df.set_index('Time', drop=True, inplace=True)
    df = df.convert_dtypes()
    df.to_pickle(filePath)

Parker Sprinkle · Answer 7 · Thu Jun 23 2022 08:01:26 GMT+0800 (China Standard Time)

@pashaLyb

I have ~1GB TDMS files I would like to read in as dataframes. I am currently using the built-in as.dataframe() method like so to read in only certain channels:

TdmsFile(file1).as_dataframe().iloc[:,channel_list]

I would like to read them in parallel using the multi-processing package. What does the "read_TDMS_Data" function look like in your "read_TDMS_Data_Parallel" function ?

Pasha Lyubarsky · Answer 8 · Wed Jul 20 2022 17:50:53 GMT+0800 (China Standard Time)

@spri902 sorry for late reply..

tdms_data_path and gr_name I had to define as global because p.map() function takes only one variable or only one that should be maped

There are 499 channels in my TDMS file and I read only some of them, that are relevant. That is why I'm using predefined channelList

def read_TDMS_Data(channelName):
    with TdmsFile.open(tdms_data_path) as tdms_file:
        data = tdms_file[gr_name][channelName][:]
        print(f'\n{channelName} is loaded')
        return data

def read_TDMS_Data_Parallel(channelList):
    p = Pool()
    result = p.map(read_TDMS_Data, channelList)
    p.close()
    p.join()
    return result

Pasha Lyubarsky · Answer 9 · Wed Jul 20 2022 17:56:32 GMT+0800 (China Standard Time)

A comment that might help the original poster:

I have used 2 GB tdms files without any problems with nptdms. However I noticed depending on how the labview program is written tdms files can "get fragmented" (maybe this is interleaved data?) and this will slow down reading a lot. An indication if you have such a fragmentation issue is that the tdms index file is large (MB instead if KB).

My TDMS files are indeed very fragmented.