Feature request: read headers only

Question

Feature request: read headers only

dojeda opened this issue 5 years ago · comments

I am sorry if I file so many issues at once. What I really wanted was to implement the feature explained in this issue, but when doing so I found the problems/discussions points filed on #34 #35 and #36.

The feature I would like to propose is an extension to load_xdf that is significantly faster but only reads the headers.

My use-case is the following: I am reading and importing all my XDF files into another database/filesystem and I wanted to save the information present on the headers in a database (hence my discussion on #36). The problem is that I have a lot of files, and some of them are big (about 3Gb, probably recordings that started and was then forgotten, but it could be a long recording session).

The way I plan to implement this is to use the tag that identifies the chunk header and the chunk size in bytes (https://github.com/sccn/xdf/wiki/Specifications#chunk). When the tag is not of types FileHeader (1) or StreamHeader (2) I will move the file pointer to the beginning of the next chunk.

I managed to achieve this with the following code:

def load_xdf(filename,
             ...,
             headers_only=False):

           ...

            # read [Tag]
            tag = struct.unpack('<H', f.read(2))[0]
            log_str = ' Read tag: {} at {} bytes, length={}'.format(tag, f.tell(), chunklen)
            if tag in [2, 3, 4, 6]:
                StreamId = struct.unpack('<I', f.read(4))[0]
                log_str += ', StreamId={}'.format(StreamId)

            logger.debug(log_str)
            # ^^^ Keeping this code to show a reference of where the modification goes

            # Quick read of header only: when the chunk if not a header, move
            # the file pointer to the beginning of the next chunk
            if headers_only and tag not in (1, 2):
                offset = 2   # We already read 2 bytes for the tag
                if tag in (2, 3, 4, 6):
                    # In these cases, we already read 4 bytes!
                    offset += 4
                # Move n=chunklen-offset bytes forward, relative to current position (whence=1)
                f.seek(chunklen - offset, 1)
                continue

With this modification, I can cut down the read time of a 3Gb file from 2 1/2 minutes to 5 seconds. Considering I have several hundreds of files (not that big, though), I am saving quite a lot of time.
If you guys agree with this modification, I would gladly make a PR for it.

Chadwick Boulay · Answer 1 · Wed Feb 27 2019 01:14:17 GMT+0800 (China Standard Time)

In general I very much like this idea.

However, I've never used this feature but I've been told that LabRecorder supports adding a new stream after recording has already started, which would put the header for that stream somewhere in the middle of the file. Do you think you can accommodate that?

David Ojeda · Answer 2 · Wed Feb 27 2019 01:50:58 GMT+0800 (China Standard Time)

Thanks for the input @cboulay.

I agree with this idea, headers can be in any part of the file. In my very first approach, I used the _scan_forward function to move between chunk boundaries (which is why I filed #34).
It worked well for my case because my headers are always at the beginning of the file. However, if a header came anywhere between two boundaries, I would missed them.

The second approach, the one that I am proposing above, would accommodate to your use-case: the file description is actually moved to the beginning of the next chunk.

Please, correct me if I'm wrong: if one adds a new stream after the recording has already started, it would create a write a new Chunk with tag 2 (StreamHeader) with its header info.
However, it does not work if the StreamHeader chunk is embedded inside the data chunk of another stream. I believe that this case would lead to a corrupted file.
In other words, do the threads that write to the file protected by some mutex?
I'll try reading the LabRecorder source code to confirm this.

David Medine · Answer 3 · Wed Feb 27 2019 05:20:09 GMT+0800 (China Standard Time)

It would be good to have a utility that can parse xdf files and manipulate just the meta-data. This is definitely pointing towards BIDS compatibility. But, in LSL world, streams come and go and the arbitrariness of this is part of its design and I would not try to restructure this. Is there any reason to deal with the binary files directly in this context? If I were to make an xdf2bids (for example) tool, I would read the whole data file into Matlab/Python first and parse the returned structures/dictionaries to create the new meta-data, sidecar files, data files, etc. This takes bit more time, but in the end it is a more flexible and easier way to deal with the data. Maybe I'm misunderstanding your problem, but I can't actually imagine a scenario when any header data chunk would be embedded in a data chunk. The write process is always protected by a mutex lock. LabRecorder would never write a non-corrupt file otherwise. Chunks are always written one after another. In practice, headers get written at the beginning, but it is certainly the case that LabRecorder will start to write data chunks before it completes writing the meta-data for all the requested streams get picked up. I should also say that in the 5 years that I have been using LSL I have only come across one corrupt data file (in the sense that the binary coding couldn't be parsed due to some fallacious ambiguity in the chunk header codes).

…

On 02/26/2019 06:50 PM, David Ojeda wrote: Thanks for the input @cboulay <https://github.com/cboulay>. I agree with this idea, headers can be in any part of the file. In my very first approach, I used the |_scan_forward| function to move between chunk boundaries (which is why I filed #34 <#34>). It worked well for my case because my headers are always at the beginning of the file. However, if a header came anywhere between two boundaries, I would missed them. The second approach, the one that I am proposing above, would accommodate to your use-case: the file description is actually moved to the *beginning of the next chunk*. Please, correct me if I'm wrong: if one adds a new stream after the recording has already started, it would create a write a new Chunk with tag 2 (StreamHeader) with its header info. However, it does not work if the StreamHeader chunk is /embedded/ inside the data chunk of another stream. I believe that this case would lead to a corrupted file. In other words, do the threads that write to the file protected by some mutex? I'll try reading the LabRecorder source code to confirm this. — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#37 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ADch7r0MyvPfsl-lGtT6-XrvW_x5OWLyks5vRXQDgaJpZM4bSXv6>.

David Ojeda · Answer 4 · Wed Feb 27 2019 17:07:45 GMT+0800 (China Standard Time)

I am not saying it would be useful to have headers embedded in data. I was just commenting (probably misunderstanding) Chadwick's comment on having new streams in the middle of the file. As I commented before, this case is handled by my proposal since I am following the specification. Each chunk is read (at the very minimum its size and tag), but the data part is ignored when it's not a header chunk.

Clemens Brunner · Answer 5 · Thu Apr 04 2019 17:19:31 GMT+0800 (China Standard Time)

I've implemented this functionality in parse_xdf (currently only in this Gist, which will later maybe become part of some package, depending on where it fits best).

David Ojeda · Answer 6 · Thu Apr 04 2019 17:51:58 GMT+0800 (China Standard Time)

Hi Clemens, I also have a fork on my side with this feature along with other issues implemented. Have a look so we can synchronize and create a PR Cheers David

…

On Thu, Apr 4, 2019, 11:19 Clemens Brunner ***@***.***> wrote: I've implemented this functionality in parse_xdf (currently only in this Gist <https://gist.github.com/cbrnr/7164a12868bc7616056587a41f51e92d>, which will later maybe become part of some package, depending on where it fits best). — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <#37 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAwkqlBmAGvJFjop_zux-v3V9iq2HDkDks5vdcOjgaJpZM4bSXv6> .

Clemens Brunner · Answer 7 · Thu Apr 04 2019 18:12:23 GMT+0800 (China Standard Time)

Yes, we should definitely consolidate our efforts - it doesn't make sense if everyone develops the same features over and over. I saw that you really only read the headers without the XML in the stream header. In my version, I do read the XML and parse certain things, but not the raw data samples. I found that it is this step that takes time, everything else is very fast.

Furthermore, you introduced an option to select an XML to dict converter - did you have problematic use cases where _xml2dict did not work?

David Ojeda · Answer 8 · Fri Apr 05 2019 05:13:49 GMT+0800 (China Standard Time)

I agree with you for a consolidated effort.

I don't really understand why you think that I only read the headers without the XML in the stream. This is not what's going on my commit from Feb 26th: dojeda@e209638
when the tag is not a header (id 1 or 2), this quick-read part is not executed. In other words, all streams that are data are quickly captured here and replaced with a file seek to move forward.

I read your gist: how to you propose to integrate it with pyxdf?
I feel the original pyxdf code may not be the most organized code, but I would strongly prefer to keep it as intact as possible for a couple of reasons: There is a lot of know-how on the original code (a complete rewrite may reintroduce old and new bugs), and to keep backward compatibility.

Concerning the XML to dict converter, I don't have a specific problem, but more of a disagreement on how this information is decoded and how this makes the code that uses this information less readable and specially inconvenient.
I introduced a different approach to this problem as a possible implementation to #36. During my usage and maintenance of code that uses pyxdf, I find a lot of unnecessary list operations (see original issue for an example) that are the consequence of pyxdf trying to implement their on XML to dict decoder (which is actually a hard task because there is no bijection between XML and JSON). This is why I propose an option where the XML is decoded by following a standard decoding approach (the Parker conversion). Unfortunately, there are some temporary references and bookeeping that fail with this new approach, and again, I want to keep the backward compatibility, so this XML decoding approach is optional on my fork (and the original one is still kept for the bookeeping code).

Finally, I am still not filing a PR for the issues that I filed a month ago because only one of these issues got a bit of traction (this issue in particular), and I still have a new feature to implement concerning keeping the original timestamps even if there is a clock sync and dejittering operation. I really need this feature because I have found some files where the timestamps have some nasty jitter and while the dejitter algorithm does a great job to reconstruct a sensible timeline, we need to apply our own correction in some cases (i.e. we want to use pyxdf dejittering most of the time, and some other times we want to roll our own correction).

Clemens Brunner · Answer 9 · Fri Apr 05 2019 13:38:28 GMT+0800 (China Standard Time)

I don't really understand why you think that I only read the headers without the XML in the stream. This is not what's going on my commit from Feb 26th: dojeda/xdf@e209638
when the tag is not a header (id 1 or 2), this quick-read part is not executed. In other words, all streams that are data are quickly captured here and replaced with a file seek to move forward.

Alright, I missed the StreamHeader (2) part.

I read your gist: how to you propose to integrate it with pyxdf?
I feel the original pyxdf code may not be the most organized code, but I would strongly prefer to keep it as intact as possible for a couple of reasons: There is a lot of know-how on the original code (a complete rewrite may reintroduce old and new bugs), and to keep backward compatibility.

Good question, I don't know yet. We could either modify load_xdf (as per your suggestion) or include parse_xdf. I find the chunk-based way of aggregating the data useful (e.g. for debugging purposes), which I think cannot be done with load_xdf, so it might be worth having both functions.

Regarding the XML parser, I think this should be discussed in a separate issue. Same for the original timestamps.

Since this repo has moved to https://github.com/xdf-modules/xdf-Python, we should also continue our discussion of this issue there.

David Ojeda · Answer 10 · Fri Apr 05 2019 16:08:53 GMT+0800 (China Standard Time)

I didn't know this repo was moved elsewhere... I have been filing some issues since a couple of months ago and I did not find any information regarding this.
I can move them along with the discussion if needed...
Perhaps a maintainer can point to or explain what the roadmap for pyxdf is?
@chkothe

Clemens Brunner · Answer 11 · Fri Apr 05 2019 19:40:26 GMT+0800 (China Standard Time)

@cboulay is the current maintainer AFAIK

Chadwick Boulay · Answer 12 · Fri Apr 05 2019 23:40:10 GMT+0800 (China Standard Time)

It hasn't moved yet. I made a pull request proposing it to be moved, and for that pull request to work I had to copy the code into the component submodules. Please take a look at the PR and let me know what you think. So far I've only received a little bit of feedback, but it's been positive!

Note that sccn/xdf remains and is still the home. Any documents related to the xdf file spec will stay here. What has moved is the python code, matlab code, and the new addition of C++ libxdf, though the python and matlab code are still linked here via submodules.

This separation is desirable for several reasons:

we can collect 'unofficial' tools related to xdf (e.g. libxdf)
we can give access to contributors so they can update the code faster than it has been (this is a big problem for xdf)
however, changes in the submodules won't be reflected in upstream sccn/xdf until someone updates the submodules pointers (requires write access to sccn/xdf) so there's no risk of the contributors breaking something 'official'.

There are also some conveniences for build systems and continuous integration but those are relatively minor as long as the only 'testable' part is Python.