shenwei356 / seqkit

A cross-platform and ultrafast toolkit for FASTA/Q file manipulation

Home Page:https://bioinf.shenwei.me/seqkit

Repository from Github https://github.comshenwei356/seqkitRepository from Github https://github.comshenwei356/seqkit

seqkit grep read name partial match

avilella opened this issue · comments

Can seqkit grep find partial matches of the read id in a fastq.gz input file?

For example, for a fastq.gz file with read id like the one below:

@4b8af4c8-779b-45cd-9bce-7195b765bb06 runid=c4c0924f086b14c61cf8eb6b1607978c6a9990bd read=68104 ch=2069 start_time=2024-07-30T06:12:28.276857+01:00 flow_cell_id=PAW35759 protocol_group_id=LSK114-SB-B1350 sample_id=SB-B1350 parent_read_id=4b8af4c8-779b-45cd-9bce-7195b765bb06 basecall_model_version_id=dna_r10.4.1_e8.2_400bps_hac@v4.3.0

Could I have a list of read is in a file search.txt that look like the string before runid=..., e.g.:

4b8af4c8-779b-45cd-9bce-7195b765bb06

Thanks

4b8af4c8-779b-45cd-9bce-7195b765bb06 is exactly the default sequence ID, so just use the default options.

seqkit grep -f search.txt reads.fq.gz -o out.fq.gz

It's very strange. Please provide the seqkit version,

seqkit version

and a few records of the reads file as an attached file.

seqkit head -n 10 PAW35759_pass_83522a2c_c4c0924f_56.fastq.gz -o test.fq.gz

Sorry, I mean FASTQ records, not just the sequence ID.

test.fq.gz

I also tried with seqkit v2.8.2, same result.

I have no idea now. It woked for me.

$ seqkit seq -ni test.fq.gz -o ids.txt
$ seqkit grep -f ids.txt test.fq.gz -o out.fq.gz
$ seqkit sum --quiet test.fq.gz out.fq.gz
seqkit.v0.1_DLS_k0_4ef19986849aad9428ede02bf8ff6028     test.fq.gz
seqkit.v0.1_DLS_k0_4ef19986849aad9428ede02bf8ff6028     out.fq.gz

fx2tab and awk version

$ seqkit fx2tab test.fq.gz | awk '{print $1}' | head -n 10 > ids2.txt 
$ seqkit grep -f ids2.txt test.fq.gz -o out2.fq.gz 
[INFO] 10 patterns loaded from file

$ seqkit sum --quiet *.fq.gz
seqkit.v0.1_DLS_k0_4ef19986849aad9428ede02bf8ff6028     out2.fq.gz
seqkit.v0.1_DLS_k0_4ef19986849aad9428ede02bf8ff6028     out.fq.gz
seqkit.v0.1_DLS_k0_4ef19986849aad9428ede02bf8ff6028     test.fq.gz

Info

$ seqkit version 
seqkit v2.8.2

$ uname -a
Linux mBio 6.6.41-1-MANJARO #1 SMP PREEMPT_DYNAMIC Fri Jul 19 14:57:10 UTC 2024 x86_64 GNU/Linux

Try using cat -A to check the id files.

What??? How's that possible? awk's problem?
Did you try

$ seqkit seq -ni test.fq.gz -o ids.txt
$ seqkit grep -f ids.txt test.fq.gz -o out.fq.gz

Both commands show the Terminated stderr:

(base) petmedix@LS21:/bfx_share1/quick_share$ ~/seqkit seq -ni test.fq.gz -o ids.txt
Terminated
(base) petmedix@LS21:/bfx_share1/quick_share$ seqkit grep -f ids.txt test.fq.gz -o out.fq.gz
[WARN] 0 patterns loaded from file
Terminated

This is with a freshly downloaded version of seqkit. I wonder what could cause the binary to fail?

No idea. Forget it.