GZ compressed input

Question

GZ compressed input

hannesbecher opened this issue 3 years ago · comments

Hi,

thanks for making available this tool! Would it be possible to make it run on gz-compressed files? Input via a sub shell à la
respect -i <(zcat reads.fq.gz) Does not seem to work.

Many thanks,

Hannes

Shahab Sarmashghi · Answer 1 · Tue Feb 16 2021 08:38:38 GMT+0800 (China Standard Time)

Hi Hannes,

Thank you for using RESPECT and for your feedback. I added the functionality to support gzipped FASTQ or FASTA files. Please try it and let me know if there was any issue.

Bests, Shahab.

Hannes Becher · Answer 2 · Tue Feb 16 2021 22:42:51 GMT+0800 (China Standard Time)

Hi Shahab,

Thanks for implementing this! I just tried it and I get an error. It looks like you might be calling my installed version of gzip, which does not take the same parameters as on your system. I'm using a server with scientific linux 7.8. Perhaps using the GZip python library would be possible as an alternative?

Error:

(RESPECT) bash-4.2$ time respect -i E001fw.fq.gz -o . --threads 20
2021-02-16 14:29:54,778 INFO:Processing E001fw.fq.gz...
gzip: invalid option -- 'k'
Try `gzip --help' for more information.
2021-02-16 14:30:26,974 INFO:compute_kmer_histogram finished in 0.3333916664123535 seconds
2021-02-16 14:30:26,974 ERROR:Error occurred when processing /disk2/hbecher_tmp/RESPECTanalyses/E001fw.fq.gz; it's skipped
Traceback (most recent call last):
  File "/localdisk/home/hbecher/miniconda2/envs/RESPECT/lib/python3.8/site-packages/respect-0.0.1-py3.8.egg/respect/respect_functions.py", line 245, in run_respect
    parameter_estimator.set_kmer_histogram(args.threads)
  File "/localdisk/home/hbecher/miniconda2/envs/RESPECT/lib/python3.8/site-packages/respect-0.0.1-py3.8.egg/respect/paramter_estimator.py", line 212, in set_kmer_histogram
    self.compute_kmer_histogram(n_threads)
  File "/localdisk/home/hbecher/miniconda2/envs/RESPECT/lib/python3.8/site-packages/respect-0.0.1-py3.8.egg/respect/timer.py", line 68, in wrapper_timer
    return func(*args, **kwargs)
  File "/localdisk/home/hbecher/miniconda2/envs/RESPECT/lib/python3.8/site-packages/respect-0.0.1-py3.8.egg/respect/paramter_estimator.py", line 171, in compute_kmer_histogram
    profiler_output = kmer_profiler(self.input_file, self.sequence_type, self.output_name, self.tmp_dir,
  File "/localdisk/home/hbecher/miniconda2/envs/RESPECT/lib/python3.8/site-packages/respect-0.0.1-py3.8.egg/respect/profiling.py", line 91, in kmer_profiler
    os.remove(input_file.rsplit('.gz', 1)[0])
FileNotFoundError: [Errno 2] No such file or directory: '/disk2/hbecher_tmp/RESPECTanalyses/E001fw.fq'
ValueError: Number of processes must be at least 1

Thanks v much,
Hannes

Hannes Becher · Answer 3 · Tue Feb 16 2021 22:59:35 GMT+0800 (China Standard Time)

A quick fix for me would be in profiling.py line 78, to set cmd to something that does zcat [input file.gz] > [input file].

But I don't know if zcat is present on all systems.

So using a python library might still be better.

Shahab Sarmashghi · Answer 4 · Thu Feb 18 2021 02:08:43 GMT+0800 (China Standard Time)

Hi Hannes,

There is a new release that you can use --decomp option to specify a python library (zlib or gzip) for decompression instead of using built-in gzip. Both libraries seem to be standard python libraries but I prefer zlib implementation because it reads the input in chunks and does not load the entire file into memory, something not possible when using gzip library. Still, I can imagine that command-line gzip should be the most efficient option. It seems that you need version 1.6 or later to use it with -k option.

Hannes Becher · Answer 5 · Thu Feb 18 2021 05:17:52 GMT+0800 (China Standard Time)

Hi Shahab,
Thanks, --decomp zlib works for me! I was not aware of the zlib library, good point about the memory.
This is done as far as I am concerned.
Cheers, Hannes