Fast way of reading some chunks from a tdms file

Question

Fast way of reading some chunks from a tdms file

Alfaena opened this issue 2 years ago · comments

First of all it is my first post on Github. I will try to be as clear as possible
I have a big tdms file (more than 8 Gbits). In order to not completely filled my memory I am reading my file and applying a treatment chunk by chunk.
Sometimes I just want to read a part of my file.
For example I have a file with 82295 chunks of 10000 points. I would like to open only between the chunk 20000 et 25000 and save in a variable my data.
This is an example of what I have done using islice iterator tool in order to used only a slice of the iterator generated from the function data_chunk of nptdms. I can not attached an example tdms file because it is too big.

# Module
from nptdms import TdmsFile
from itertools import islice #allow to take a slice from an iterator object
import numpy as np

# Inputs
path_tdms = 'test.tdms'
chunk_first = 20000 #first chunk to read
chunk_last = 25000 #flast chunk to read
group_name = 'data'
channel_name = 'V1'

# Script
print('Reading between chunk %i and %i' % (chunk_first, chunk_last - 1))  
with TdmsFile.open(path_tdms) as tdms_file:
  part_tdms = islice(enumerate(tdms_file.data_chunks()), chunk_first, chunk_last)
  data_int = []
  for i, chunk in part_tdms:
    channel_chunk = chunk[group_name][channel_name]
    data = channel_chunk[:]  
    data_int.append(data)  
data = np.hstack(data_int)

So right now it is working I am only storing the data of the chunk between 20000 and 25000. But with isIice I am still iterating over all chunk before 20000, which can be quite slow sometime.
Here it is finally my question : Is it possible to skip the first 20000 iteration?
I have seen on stack overflow here or this issue some way of doing it by using seek but I do not understand how is it possible to do it with chunks with the structure of a tdms file.

If it is not clear or you need more information I can give them to you.

Adam Reeve · Answer 1 · Tue Feb 08 2022 18:11:07 GMT+0800 (China Standard Time)

Hi @Alfaena, rather than working with the data chunks from the file it might be simpler to use npTDMS's built in support for lazily reading a slice of data for a specific channel. Eg. you should be able to do something like this to only read the data you want:

from nptdms import TdmsFile

# Inputs
path_tdms = 'test.tdms'
chunk_first = 20000 # first chunk to read
chunk_last = 25000 # last chunk to read
chunk_size = 10000
group_name = 'data'
channel_name = 'V1'

# Script
print('Reading between chunk %i and %i' % (chunk_first, chunk_last - 1))  
with TdmsFile.open(path_tdms) as tdms_file:
    channel = tdms_file[group_name][channel_name]
    data = channel[chunk_first * chunk_size:chunk_last * chunk_size]

That said, if there was really a need to be able to read specific file chunks then that wouldn't be too hard to implement.

Alfaena · Answer 2 · Wed Feb 09 2022 12:11:34 GMT+0800 (China Standard Time)

Hi @adamreeve,

Yes you are right there is this way of doing it, I remember trying this way a long time ago and it was taking ages to open a file but it turns out I was wrong.
I tried your way and it works, it significantly reduce the time!

Thank a lot for your answer!

Adam Reeve · Answer 3 · Wed Feb 09 2022 12:26:46 GMT+0800 (China Standard Time)

Great, I'm glad there was a simple solution! I made some changes to improve performance in this area a while ago (eg. #203) so it could be that you previously tried this with an older version before those changes.