Fast way of reading some chunks from a tdms file
Alfaena opened this issue · comments
First of all it is my first post on Github. I will try to be as clear as possible
I have a big tdms file (more than 8 Gbits). In order to not completely filled my memory I am reading my file and applying a treatment chunk by chunk.
Sometimes I just want to read a part of my file.
For example I have a file with 82295 chunks of 10000 points. I would like to open only between the chunk 20000 et 25000 and save in a variable my data.
This is an example of what I have done using islice iterator tool in order to used only a slice of the iterator generated from the function data_chunk of nptdms. I can not attached an example tdms file because it is too big.
# Module
from nptdms import TdmsFile
from itertools import islice #allow to take a slice from an iterator object
import numpy as np
# Inputs
path_tdms = 'test.tdms'
chunk_first = 20000 #first chunk to read
chunk_last = 25000 #flast chunk to read
group_name = 'data'
channel_name = 'V1'
# Script
print('Reading between chunk %i and %i' % (chunk_first, chunk_last - 1))
with TdmsFile.open(path_tdms) as tdms_file:
part_tdms = islice(enumerate(tdms_file.data_chunks()), chunk_first, chunk_last)
data_int = []
for i, chunk in part_tdms:
channel_chunk = chunk[group_name][channel_name]
data = channel_chunk[:]
data_int.append(data)
data = np.hstack(data_int)
So right now it is working I am only storing the data of the chunk between 20000 and 25000. But with isIice I am still iterating over all chunk before 20000, which can be quite slow sometime.
Here it is finally my question : Is it possible to skip the first 20000 iteration?
I have seen on stack overflow here or this issue some way of doing it by using seek but I do not understand how is it possible to do it with chunks with the structure of a tdms file.
If it is not clear or you need more information I can give them to you.
Hi @Alfaena, rather than working with the data chunks from the file it might be simpler to use npTDMS's built in support for lazily reading a slice of data for a specific channel. Eg. you should be able to do something like this to only read the data you want:
from nptdms import TdmsFile
# Inputs
path_tdms = 'test.tdms'
chunk_first = 20000 # first chunk to read
chunk_last = 25000 # last chunk to read
chunk_size = 10000
group_name = 'data'
channel_name = 'V1'
# Script
print('Reading between chunk %i and %i' % (chunk_first, chunk_last - 1))
with TdmsFile.open(path_tdms) as tdms_file:
channel = tdms_file[group_name][channel_name]
data = channel[chunk_first * chunk_size:chunk_last * chunk_size]
That said, if there was really a need to be able to read specific file chunks then that wouldn't be too hard to implement.
Hi @adamreeve,
Yes you are right there is this way of doing it, I remember trying this way a long time ago and it was taking ages to open a file but it turns out I was wrong.
I tried your way and it works, it significantly reduce the time!
Thank a lot for your answer!
Great, I'm glad there was a simple solution! I made some changes to improve performance in this area a while ago (eg. #203) so it could be that you previously tried this with an older version before those changes.