seqkit grep read name partial match
avilella opened this issue · comments
Can seqkit grep
find partial matches of the read id in a fastq.gz input file?
For example, for a fastq.gz file with read id like the one below:
@4b8af4c8-779b-45cd-9bce-7195b765bb06 runid=c4c0924f086b14c61cf8eb6b1607978c6a9990bd read=68104 ch=2069 start_time=2024-07-30T06:12:28.276857+01:00 flow_cell_id=PAW35759 protocol_group_id=LSK114-SB-B1350 sample_id=SB-B1350 parent_read_id=4b8af4c8-779b-45cd-9bce-7195b765bb06 basecall_model_version_id=dna_r10.4.1_e8.2_400bps_hac@v4.3.0
Could I have a list of read is in a file search.txt
that look like the string before runid=...
, e.g.:
4b8af4c8-779b-45cd-9bce-7195b765bb06
Thanks
4b8af4c8-779b-45cd-9bce-7195b765bb06
is exactly the default sequence ID, so just use the default options.
seqkit grep -f search.txt reads.fq.gz -o out.fq.gz
It's very strange. Please provide the seqkit version,
seqkit version
and a few records of the reads file as an attached file.
seqkit head -n 10 PAW35759_pass_83522a2c_c4c0924f_56.fastq.gz -o test.fq.gz
Sorry, I mean FASTQ records, not just the sequence ID.
I also tried with seqkit v2.8.2, same result.
I have no idea now. It woked for me.
$ seqkit seq -ni test.fq.gz -o ids.txt
$ seqkit grep -f ids.txt test.fq.gz -o out.fq.gz
$ seqkit sum --quiet test.fq.gz out.fq.gz
seqkit.v0.1_DLS_k0_4ef19986849aad9428ede02bf8ff6028 test.fq.gz
seqkit.v0.1_DLS_k0_4ef19986849aad9428ede02bf8ff6028 out.fq.gz
fx2tab and awk version
$ seqkit fx2tab test.fq.gz | awk '{print $1}' | head -n 10 > ids2.txt
$ seqkit grep -f ids2.txt test.fq.gz -o out2.fq.gz
[INFO] 10 patterns loaded from file
$ seqkit sum --quiet *.fq.gz
seqkit.v0.1_DLS_k0_4ef19986849aad9428ede02bf8ff6028 out2.fq.gz
seqkit.v0.1_DLS_k0_4ef19986849aad9428ede02bf8ff6028 out.fq.gz
seqkit.v0.1_DLS_k0_4ef19986849aad9428ede02bf8ff6028 test.fq.gz
Info
$ seqkit version
seqkit v2.8.2
$ uname -a
Linux mBio 6.6.41-1-MANJARO #1 SMP PREEMPT_DYNAMIC Fri Jul 19 14:57:10 UTC 2024 x86_64 GNU/Linux
Try using cat -A
to check the id files.
What??? How's that possible? awk's problem?
Did you try
$ seqkit seq -ni test.fq.gz -o ids.txt
$ seqkit grep -f ids.txt test.fq.gz -o out.fq.gz
Both commands show the Terminated
stderr:
(base) petmedix@LS21:/bfx_share1/quick_share$ ~/seqkit seq -ni test.fq.gz -o ids.txt
Terminated
(base) petmedix@LS21:/bfx_share1/quick_share$ seqkit grep -f ids.txt test.fq.gz -o out.fq.gz
[WARN] 0 patterns loaded from file
Terminated
This is with a freshly downloaded version of seqkit. I wonder what could cause the binary to fail?
No idea. Forget it.