piercelab / tcrmodel2

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

How to deal with no results from hmmsearch

ZhiYeG opened this issue · comments

Please check the hmmsearch log below, even use E-value 1 still do not have targets return. And then the error is
Traceback (most recent call last):
File "/tcrmodel2//run_tcrmodel2.py", line 298, in
app.run(main)
File "/tools/tcrmodel2/env/lib/python3.8/site-packages/absl/app.py", line 312, in run
_run_main(main, args)
File "/tools/tcrmodel2/env/lib/python3.8/site-packages/absl/app.py", line 258, in _run_main
sys.exit(main(argv))
File "//tools/tcrmodel2//run_tcrmodel2.py", line 94, in main
mhca_seq=seq_utils.trim_mhc(mhca_seq, "2", ".", out_dir)
File "/tools/tcrmodel2/scripts/seq_utils.py", line 57, in trim_mhc
start=int(line[59:66])
ValueError: invalid literal for int() with base 10: ''

hmmsearch :: search profile(s) against a sequence database
HMMER 3.3.2 (Nov 2020); http://hmmer.org/
Copyright (C) 2020 Howard Hughes Medical Institute.
Freely distributed under the BSD open source license.


query HMM file: ./scripts/mhc_hmm/classII_alpha.hmm
target sequence database: /tmp.fa
show alignments in output: no
sequence inclusion threshold: E-value <= 1


Query: classII_mhc_alpha_seqs_rep.mafft [M=84]
Scores for complete sequences (score includes all domains):
--- full sequence --- --- best 1 domain --- -#dom-
E-value score bias E-value score bias exp N Sequence Description
------- ------ ----- ------- ------ ----- ---- -- -------- -----------

[No hits detected that satisfy reporting thresholds]

Domain annotation for each sequence:

[No targets detected that satisfy reporting thresholds]

Internal pipeline statistics summary:

Query model(s): 1 (84 nodes)
Target sequences: 1 (100 residues searched)
Passed MSV filter: 0 (0); expected 0.0 (0.02)
Passed bias filter: 0 (0); expected 0.0 (0.02)
Passed Vit filter: 0 (0); expected 0.0 (0.001)
Passed Fwd filter: 0 (0); expected 0.0 (1e-05)
Initial search space (Z): 1 [actual number of targets]
Domain search space (domZ): 0 [number of targets reported over threshold]
CPU time: 0.00u 0.00s 00:00:00.00 Elapsed: 00:00:00.00
Mc/sec: 17.97'

Hi @ZhiYeG ,

Thanks for reaching out on the hmmsearch issue you encountered. To help me further investigate the issue, could you please share with me the sequence input you used for this run? This allow me understand the context of the error better.

Best,
Rui

tcra_seq
KQEVTQIPAALSVPEGENLVLNCSFTDSAIYNLQWFRQDPGKGLTSLLLIQSSQREQTSGRLNASLDKSSGRSTLYIAASQPGDSATYLCAVRMDSSYKLIFGSGTRLLVRPDIQNPDPAVYQLRDSKSSDKSVCLFTDFDSQTNVSQSKDSDVYITDKCVLDMRSMDFKSNSAVAWSNKSDFACANAFNNSII
tcrb_seq
TGVSQNPRHKITKRGQNVTFRCDPISEHNRLYWYRQTLGQGPEFLTYFQNEAQLEKSRLLSDRFSAERPKGSFSTLEIQRTEQGDSAMYLCASSSWDTGELFFGEGSRLTVLEDLKNVFPPEVAVFEPSEAEISHTQKATLVCLATGFYPDHVELSWWVNGKEVHSGVCTDPQPLKEQPALNDSRYALSSRLRVSATFWQNPRNHFRCQVQFYGLSENDEWTQDRAKPVTQIVSAEAWGRAD',
pep_seq
RYPLTFGWCF
mhca_seq
MIQRTPKIQVYSRHPAENGKSNFLNCYVSGFHPSDIEVDLLKNGERIEKVEHSDLSFSKDWSFYLLYYTEFTPTEKDEYACRVNHVTLSQPKIVKWDRDM
mhcb_seq
GSHSMRYFSTSVSRPGRGEPRFIAVGYVDDTQFVRFDSDAASQRMEPRAPWIEQEGPEYWDEETGKVKAHSQTDRENLRIALRYYNQSEAGSHTLQMMFGCDVGSDGRFLRGYHQYAYDGKDYIALKEDLRSWTAADMAAQITKRKWEAAHVAEQQRAYLEGTCVDGLRRYLENGKETLQRTDPPKTHMTHHPISDHEATLRCWALGFYPAEITLTWQRDGEDQTQDTELVETRPAGDGTFQKWAAVVVPSGEEQRYTCHVQHEGLPKPLTLRW

Hi Rui Yin
Thanks for the quick reply. I have paste the whole fasta file above. Now I just use try, except to skip those code once it failed.

Also, I have met another problem when I use. Error log shown below.
I0822 05:12:05.554947 140554434131776 pipeline_multimer_custom_templates.py:221] Running monomer pipeline on chain C: pMHC
I0822 05:12:05.555079 140554434131776 pipeline_custom_templates.py:251] input_sequence is FEAQKAKANKAVDIEADHVGTYGISVYQSPGDIGQYTFEFDGDELFYVDLDKKETVWMLPEFGQLASFDPQGGLQNIAVVKHNLGVLTKRSNSTPATRHFVYQFMGECYFTNGTQRIRYVTRYIYNREEYVRYDSDVGEHRAVTELGRPDAEYWNSQPEILERTRAELDTVCRHNYEGPETHTSLR
WARNING chainbreak: B 1 2 25.097618392987012 data/templates/pdb/cp1/3c5z.pmhc.pdb
Traceback (most recent call last):
File "run_alphafold_tcrmodel2.3.py", line 712, in
app.run(main)
File "/hpc-cache-pfs/home/monomer/tools/tcrmodel2/env/lib/python3.8/site-packages/absl/app.py", line 312, in run
_run_main(main, args)
File "/hpc-cache-pfs/home/monomer/tools/tcrmodel2/env/lib/python3.8/site-packages/absl/app.py", line 258, in _run_main
sys.exit(main(argv))
File "run_alphafold_tcrmodel2.3.py", line 671, in main
predict_structure(
File "run_alphafold_tcrmodel2.3.py", line 269, in predict_structure
feature_dict = data_pipeline.process(
File "/hpc-cache-pfs/home/monomer/tools/tcrmodel2/alphafold/data/pipeline_multimer_custom_templates.py", line 350, in process
chain_features = self._process_single_chain(
File "/hpc-cache-pfs/home/monomer/tools/tcrmodel2/alphafold/data/pipeline_multimer_custom_templates.py", line 224, in _process_single_chain
chain_features = self._monomer_data_pipeline.process(
File "/hpc-cache-pfs/home/monomer/tools/tcrmodel2/alphafold/data/pipeline_custom_templates.py", line 396, in process
template_features = create_single_template_features(
File "/hpc-cache-pfs/home/monomer/tools/tcrmodel2/alphafold/data/custom_templates.py", line 150, in create_single_template_features
identities = sum(target_sequence[i] == template_full_sequence[j]
File "/hpc-cache-pfs/home/monomer/tools/tcrmodel2/alphafold/data/custom_templates.py", line 150, in
identities = sum(target_sequence[i] == template_full_sequence[j]
IndexError: string index out of range
Traceback (most recent call last):
File "/hpc-cache-pfs/home/monomer/tools/tcrmodel2//run_tcrmodel2.py", line 301, in
app.run(main)
File "/hpc-cache-pfs/home/monomer/tools/tcrmodel2/env/lib/python3.8/site-packages/absl/app.py", line 312, in run
_run_main(main, args)
File "/hpc-cache-pfs/home/monomer/tools/tcrmodel2/env/lib/python3.8/site-packages/absl/app.py", line 258, in _run_main
sys.exit(main(argv))
File "/hpc-cache-pfs/home/monomer/tools/tcrmodel2//run_tcrmodel2.py", line 204, in main
for idx, line in enumerate(fh):
io.UnsupportedOperation: not readable

I have print the target_to_template_alignment , shown below. It start misalign from number 97:98 marked in bold. Let'm know if these information is sufficient for you to dig into this. And great thanks for you help.
{0: 0, 1: 1, 2: 2, 3: 3, 4: 4, 5: 5, 6: 6, 7: 7, 8: 8, 9: 9, 10: 10, 11: 11, 12: 12, 13: 13, 14: 14, 15: 15, 16: 16, 17: 17, 18: 18, 19: 19, 20: 20, 21: 21, 22: 22, 23: 23, 24: 24, 25: 25, 26: 26, 27: 27, 28: 28, 29: 29, 30: 30, 31: 31, 32: 32, 33: 33, 34: 34, 35: 35, 36: 36, 37: 37, 38: 38, 39: 39, 40: 40, 41: 41, 42: 42, 43: 43, 44: 44, 45: 45, 46: 46, 47: 47, 48: 48, 49: 49, 50: 50, 51: 51, 52: 52, 53: 53, 54: 54, 55: 55, 56: 56, 57: 57, 58: 58, 59: 59, 60: 60, 61: 61, 62: 62, 63: 63, 64: 64, 65: 65, 66: 66, 67: 67, 68: 68, 69: 69, 70: 70, 71: 71, 72: 72, 73: 73, 74: 74, 75: 75, 76: 76, 77: 77, 78: 78, 79: 79, 80: 80, 81: 81, 82: 82, 83: 83, 84: 84, 85: 85, 86: 86, 87: 87, 88: 88, 89: 89, 90: 90, 91: 91, 92: 92, 93: 93, 94: 94, 95: 95, 97: 98, 98: 99, 99: 100, 100: 101, 101: 102, 102: 103, 103: 104, 104: 105, 105: 106, 106: 107, 107: 108, 108: 109, 109: 110, 110: 111, 111: 112, 112: 113, 113: 114, 114: 115, 115: 116, 116: 117, 117: 118, 118: 119, 119: 120, 120: 121, 121: 122, 122: 123, 123: 124, 124: 125, 125: 126, 126: 127, 127: 128, 128: 129, 129: 130, 130: 131, 131: 132, 132: 133, 133: 134, 134: 135, 135: 136, 136: 137, 137: 138, 138: 139, 139: 140, 140: 141, 141: 142, 142: 143, 143: 144, 144: 145, 145: 146, 146: 147, 147: 148, 148: 149, 149: 150, 150: 151, 151: 152, 152: 153, 153: 154, 154: 155, 155: 156, 156: 157, 157: 158, 158: 159, 159: 160, 160: 161, 161: 162, 162: 163, 163: 164, 164: 165, 165: 166, 166: 167, 167: 168, 168: 169, 169: 170, 170: 171, 171: 172, 172: 173, 173: 174, 174: 175, 175: 176, 176: 177, 177: 178, 178: 179, 179: 180, 180: 181, 181: 182, 182: 183, 183: 184, 184: 185, 185: 186}

Hi @ZhiYeG,

Thank you for providing the input sequence. Your target appears to be a class I TCR-pMHC complex. The mhcb_seq input you provided actually corresponds to MHC domain a1 and a2 sequences, which should be the input for the mhca_seq variable. Additionally, for our program, mhcb_seq is intended for the beta1 domain sequence of class II. When it comes to class I TCR-pMHC complexes, our program expects the mhca_seq, which is the MHC class I alpha domain sequence, and mhcb_seq should be left empty (or left out completely). Adjusting your sequence input to the following worked nicely in our machine without a problem:

--tcra_seq=KQEVTQIPAALSVPEGENLVLNCSFTDSAIYNLQWFRQDPGKGLTSLLLIQSSQREQTSGRLNASLDKSSGRSTLYIAASQPGDSATYLCAVRMDSSYKLIFGSGTRLLVRPDIQNPDPAVYQLRDSKSSDKSVCLFTDFDSQTNVSQSKDSDVYITDKCVLDMRSMDFKSNSAVAWSNKSDFACANAFNNSII --tcrb_seq=TGVSQNPRHKITKRGQNVTFRCDPISEHNRLYWYRQTLGQGPEFLTYFQNEAQLEKSRLLSDRFSAERPKGSFSTLEIQRTEQGDSAMYLCASSSWDTGELFFGEGSRLTVLEDLKNVFPPEVAVFEPSEAEISHTQKATLVCLATGFYPDHVELSWWVNGKEVHSGVCTDPQPLKEQPALNDSRYALSSRLRVSATFWQNPRNHFRCQVQFYGLSENDEWTQDRAKPVTQIVSAEAWGRAD --pep_seq=RYPLTFGWCF --mhca_seq=GSHSMRYFSTSVSRPGRGEPRFIAVGYVDDTQFVRFDSDAASQRMEPRAPWIEQEGPEYWDEETGKVKAHSQTDRENLRIALRYYNQSEAGSHTLQMMFGCDVGSDGRFLRGYHQYAYDGKDYIALKEDLRSWTAADMAAQITKRKWEAAHVAEQQRAYLEGTCVDGLRRYLENGKETLQRTDPPKTHMTHHPISDHEATLRCWALGFYPAEITLTWQRDGEDQTQDTELVETRPAGDGTFQKWAAVVVPSGEEQRYTCHVQHEGLPKPLTLRW

Let me know how it goes after adjusting your sequence input. I hope this is helpful!

Best,
Rui

OK, got it, have had chance to check the other issue that I raise above? Did it also cause by the misclassification of the type I and type II MHC?

Yes, I've checked. The other error you mentioned is related to the misclassification of type I and type II MHC. For class I targets, the accurate identification of the pMHC template is contingent on correctly identifying the alpha 1 and alpha 2 domains of the MHC alpha sequence using hmmsearch. To clarify, if the target is class I MHC, the MHC sequence in the input sequence target_sequence for create_single_template_features should strictly contain the alpha 1 and alpha 2 domains. If it lacks these domains or includes others (like alpha 3), then bypassing the seq_utils.trim_mhc step will result in an error.

The error you encountered indeed raised a good point. In response, to address the error and to prevent future inconvenience at the user's end, we've modified the program. Now, it will exit with a user-friendly error message (e.g., "Fail to identify beta 1 domain sequence in the 'mhcb_seq' input of your class II MHC target. If your input target is a class I TCR-pMHC complex, then mhcb_seq variable should be left empty or left out completely.") rather than abruptly erroring out. We appreciate your feedback in helping us improve the tool. See commit a3872a6.

Best,
Rui