TraceHeader duplication is time consuming

Question

TraceHeader duplication is time consuming

seisgo opened this issue 5 years ago · comments

segyio version: 1.8.6

I created a new segyfile based on an existed one, and duplicated the trace headers from the existed one. While, this process is time consuming, especially for big-size data. For example, the size of the existed segy file is 5.8G, for about 4 hour, the size of the new segy file is only about 1G. Actually, the time consumed of this process depends on the trace count of segy file, here the trace count is 641*1851=1186491.

Following is part of my code:

with segyio.open(args['input'], 'r') as iFile:
    spec = segyio.tools.metadata(iFile)

    # create output file
    with segyio.create(args['output'], spec) as oFile:
        for i in range(1+spec.ext_headers):
            oFile.text[i] = iFile.text[i]
        oFile.bin = iFile.bin
        oFile.header = iFile.header

This code is almost same as the example of 'create' in segyio documentation.

So, is this time-consuming process is normal?
If not, could anyone give a suggestion about copying traceheader when creating new segy file?

Thanks!

Jørgen Kvalsvik · Answer 1 · Thu Nov 14 2019 15:08:05 GMT+0800 (China Standard Time)

No, it's not normal at all, and your code looks perfectly fine. Do your system normally have this bad write throughput? Do any other segyio versions behave differently?

I've not observed header-copy to crazy slow like you observe, (un)fortunately. Do writing traces or even smaller files go as poorly?

That being said, if you're not planning on modifying the header as you go, a better way of approaching this problem is to first do a regular file copy, e.g. through shutil.copyfile, and then open the newly-created file with regular open(mode = 'r+'). That way, segyio should at least not slow down on header-copy, and you get to see if writing traces is as bad.

Zhenyu · Answer 2 · Thu Nov 14 2019 15:21:27 GMT+0800 (China Standard Time)

@jokva Thanks for your quick reply!

I have done traces initialization as follow, and it's not time consuming.

# Init trace data of output file object
for il in spec.ilines:
     oFile.iline[il] = np.zeros((nXline, nSample), dtype=np.single)

I will check some other version of segyio.

At the earliest, I have used copyfile to duplicate a new segy file, while it is also time consuming for big-size data at first glance, then I tried segyio.create. But I haven't compared the time consumed of this two operation. Thanks for your advice, I will also check it.

Jørgen Kvalsvik · Answer 3 · Thu Nov 14 2019 16:46:22 GMT+0800 (China Standard Time)

At the earliest, I have used copyfile to duplicate a new segy file, while it is also time consuming for big-size data at first glance

It is, but that's just the performance of your system - copyfile is stupidly copying bytes, which is the fastest it can be. segyio can only beat that if you need to modify, because that way, things are just written once.

I have done traces initialization as follow, and it's not time consuming.

Ok, good - inline writing in batch iike you do here is typically the fastest operation in segyio, so that's at least something.

Do you mind unrolling the trace copy like this, and then get a feeling of when or where it slows down?

for i in tqdm(range(src.tracecount)):
    dst.header[i] = src.header[i]

Zhenyu · Answer 4 · Thu Nov 14 2019 18:04:54 GMT+0800 (China Standard Time)

@jokva
As your advice, I use tqdm to monitor the process. During the duplication process, at about 3% progress, the speed reduces from 300 it/s to 60 it/s, and reduces continuously. At progress about 5%, the speed reduces to 30 it/s.

What's more, this phenomenon is not due to segyio version, I have check 1.8.3, 1.8.6 and 1.8.8.

I then changed to use shutil.copyfile, the duplication process takes only about 4 minutes.

To note, all of those above tests are on Linux (RHEL 6.5), with anaconda environment (conda/4.7.12 requests/2.22.0 CPython/3.6.9 Linux/2.6.32-431.el6.x86_64 rhel/6.5 glibc/2.12).

However, on my Windows 10 PC, the time consumed for the same trace header duplication is about half an hour, which is acceptable.

In a word, copyfile performs best.

Thanks for your reply.

Jørgen Kvalsvik · Answer 5 · Mon Dec 02 2019 18:55:23 GMT+0800 (China Standard Time)

Hi,

I ran some experiments, but were unable to reproduce poor performance on my (Debian 9) machine. I'll keep looking into it, but I found no obvious reason (in code) why it should be slow to write multiple headers.

For what it's worth, I used tqdm, and saw a throughput of ~5000 iterations per second.

Dimitrios Oikonomou · Answer 6 · Mon Dec 23 2019 17:17:29 GMT+0800 (China Standard Time)

Hi @jokva we can confirm that in ubuntu 18.04 using segyio==1.9.0 running:

def create_empty_segy_(segy, initial_segy_filename):
    spec = segyio.tools.metadata(segy)
    spec.samples = spec.samples[:1]
    with segyio.create(initial_segy_filename, spec) as dst:
        dst.text[0] = segy.text[0]
        dst.bin = segy.bin
        dst.bin.update(hns=len(spec.samples))
        dst.header = segy.header
        dst.trace = segy.trace

It takes long time to copy the headers. For a large segy 700GB+ after 2 days not completed. Reading goes down to few Mbit/s.
We do not want to copy the segy just keep the trace header info for feature use.

Jørgen Kvalsvik · Answer 7 · Tue Dec 24 2019 22:13:54 GMT+0800 (China Standard Time)

Ok, interesting. There's obviously something going on, even though I haven't been able to reproduce it yet. I'll take a new look into it.

Out of curiosity, is it slow if you're just walking the headers too, and not writing them back to disk?

Jørgen Kvalsvik · Answer 8 · Fri Dec 27 2019 05:28:02 GMT+0800 (China Standard Time)

@seisgo So I'm still not able to reproduce yet, but I would like you to test something for me. What happens if you change your code to:

with segyio.open(args['input'], 'r') as iFile:
    spec = segyio.tools.metadata(iFile)

    # create output file
    with segyio.create(args['output'], spec) as oFile:
        for i in range(1+spec.ext_headers):
            oFile.text[i] = iFile.text[i]
        oFile.bin = iFile.bin
        oFile.header[-1] = {}
        oFile.header = iFile.header

Second-to-last line, i.e. write the last header first, then copy all. Is it as slow?

Zhenyu · Answer 9 · Thu Jan 02 2020 10:31:07 GMT+0800 (China Standard Time)

@jokva Sorry for answering late.
I have test as your advice.
To my surprise, at this time, with 'oFile.header[-1] = {}' or not, the traceheader copying is normal. I am really confused.
The version of segyio is 1.8.3, and the implementation is still on my RHEL computer. For a seismic file with size 5.8G, the copying process takes about 4 minutes. In addition, with tqdm monitoring, the iteration speed is about 4000+ it/s.

Zhenyu · Answer 10 · Thu Jan 02 2020 11:22:42 GMT+0800 (China Standard Time)

@jokva
I find the cause of this anomaly. The copying speed depends on the output file directory. If it's set as a subdirectory of /home, it works well. If it's set as a subdirectory of mounted hard disk, the speed is slow and reduces continuously.

Please check if it is same on your platform.

Jørgen Kvalsvik · Answer 11 · Thu Jan 02 2020 18:34:28 GMT+0800 (China Standard Time)

What kind of mounted hard disk is it? Does it happen to be a network drive (maybe NFS)? Are there any particular mount opts?

What if you do the oFile.header[-1] trick?

Zhenyu · Answer 12 · Fri Jan 03 2020 07:45:06 GMT+0800 (China Standard Time)

@jokva
The mount type is NTFS-3G, without particular opts.

I have tested oFile.header[-1] trick, it makes no difference.

Jørgen Kvalsvik · Answer 13 · Fri Jan 03 2020 07:53:35 GMT+0800 (China Standard Time)

Is it mounted with sync, by any chance? Check with mount | grep path where path is the drive root path, e.g. /media/external

Jørgen Kvalsvik · Answer 14 · Fri Jan 03 2020 07:57:43 GMT+0800 (China Standard Time)

A couple of other experiments would be interesting too - I'm afraid I can't debug it as I don't have an NTFS drive about.

Try writing both headers and traces in tandem:

for i in range(src.tracecount):
    dst.header[i] = src.header[i]
    dst.trace[i] = src.trace[i]

If this makes fast again, it's probably seeking past unwritten stuff that tanks the performance.

edit: also, how is CPU usage? Is it high? Does ntfs-3g use a lot of CPU, or is it mostly idle?

edit2:
https://www.tuxera.com/community/ntfs-3g-faq/#slow

A highly sparse file is being regularly written or updated.

Which is exactly what you're doing.

Zhenyu · Answer 15 · Fri Jan 03 2020 08:05:34 GMT+0800 (China Standard Time)

@jokva
The slow copying operation is to directory diska. The output of mount | grep diska is
/dev/sda1 on /diska type fuseblk (rw,allow_other,blksize=4096

In comparison, the output of 'mount |grep home' is
/dev/sdc2 on /home type ext4 (rw)

Zhenyu · Answer 16 · Fri Jan 03 2020 08:27:17 GMT+0800 (China Standard Time)

Yes, writing both headers and traces makes difference.

When the output of the copying file is a ntfs-3g mounted directory, the copying speed is about 1000it/s, and CPU usage is about 50%.

When the output of the copying file is a ext4 mounted directory (/home), the speed is a bit faster, about 2000it/s with fluctuation, and CPU usage is higher than 50%, sometimes even approaching 100%.

Jørgen Kvalsvik · Answer 17 · Fri Jan 03 2020 08:36:39 GMT+0800 (China Standard Time)

It's not a great sign that CPU utilisation is so high even on ext4, but having >50% better performance on ext4 over ntfs-3g is expected.

I don't think is a problem with segyio, other than maybe documentation, it's a consequence of ntfs-3g. I'll get around to adding a FAQ entry for slow sparse writes, which is not obvious from the dst.header = src.header syntax.

Zhenyu · Answer 18 · Fri Jan 03 2020 08:42:59 GMT+0800 (China Standard Time)

Yes, from those above tests, it is not a problem with segyio.

Maybe I didn't make myself clear. The copying speed on ext4 is about twice over that on ntfs-3g.

Jørgen Kvalsvik · Answer 19 · Mon Jan 06 2020 18:42:50 GMT+0800 (China Standard Time)

I've added a proposed fix to this solution in the readme: #434

Jørgen Kvalsvik · Answer 20 · Mon Jan 13 2020 14:44:46 GMT+0800 (China Standard Time)

I'll consider this resolved, or at least the most likely cause identified, and it is outside of segyio.