No alleles called because of possible alignment issue

Question

No alleles called because of possible alignment issue

toddcreasy opened this issue 6 years ago · comments

Hi,

I'm trying to map and call alleles on fastq files from TCGA. However, out of the 400 samples I've run, only a very small number were called. The failed ones (as indicated from the output below) seem to have a possible read mapping problem and thus cbc can't solve. Is there something I'm missing from the log output that would indicate a problem? If so, how might I fix it?

I'm using the latest versions of razers3 and optitype.

mapping with 4 threads...

0:00:01.50 Mapping C500.TCGA-BT-A42B-10A-01D-A23K-08.3_gdc_realn_R1.fastq to GEN reference...

0:29:46.60 Mapping C500.TCGA-XF-A9T2-10A-01D-A42H-08.2_gdc_realn_R2.fastq to GEN reference...

0:44:43.01 Generating binary hit matrix.
0:44:43.09 Loading /ts18/ngs/studies/TCGA_BLCA_Mutect/optitype/2018_12_14_11_14_41_1.bam started. Number of HLA reads loaded (updated every thousand):
1K...2K...3K...4K...5K...6K...7K...8K.../Biomarker/ngs/software/OptiType/OptiType-1.3.2/hlatyper.py:239: FutureWarning: from_items is deprecated. Please use DataFrame.from_dict(
dict(items), ...) instead. DataFrame.from_dict(OrderedDict(items)) may be used to preserve the key order.
pos_df = pd.DataFrame.from_items(iter(hits.items())).T

0:44:47.81 8604 reads loaded. Creating dataframe...
0:44:48.28 Dataframes created. Shape: 8604 x 11179, hits: 2717490 (2717490), sparsity: 1 in 35.39
0:44:48.89 Loading /ts18/ngs/studies/TCGA_BLCA_Mutect/optitype/2018_12_14_11_14_41_2.bam started. Number of HLA reads loaded (updated every thousand):
1K...2K...3K...4K...5K...6K...7K...8K...
0:44:52.14 8222 reads loaded. Creating dataframe...
0:44:52.59 Dataframes created. Shape: 8222 x 11179, hits: 1917861 (1917861), sparsity: 1 in 47.93
0:44:53.58 Alignment pairing completed. 0 paired, 16826 unpaired, 0 discordant

WARNING: Less than 10% of reads could be paired. Consider an appropriate unpaired_weight setting in your config file (currently 0.000), because you may need to resort to using u
npaired reads.

0:44:57.43 temporary pruning of identical rows and columns

0:44:57.44 Size of mtx with unique rows and columns: (0, 1)
0:44:57.44 determining minimal set of non-overshadowed alleles

0:44:57.45 Keeping only the minimal number of required alleles (1,)

0:44:57.45 Creating compact model...

starting ilp solver with 4 threads...

0:44:57.45 Initializing OptiType model...
WARNING: Constant objective detected, replacing with a placeholder to prevent
solver failure.
Welcome to the CBC MILP Solver
Version: 2.9.9
Build Date: Oct 13 2018

command line - /home/kgvx167/anaconda3/bin/cbc -threads 4 -printingOptions all -import /home/kgvx167/tmpolFvmh.pyomo.lp -stat=1 -solve -solu /home/kgvx167/tmpolFvmh.pyomo.soln (
default strategy 1)
threads was changed from 0 to 4
Option for printingOptions changed from normal to all
CoinLpIO::readLp(): Maximization problem reformulated as minimization
Presolve 0 (-4) rows, 0 (-3) columns and 0 (-5) elements
Statistics for presolved model
Original problem has 1 integers (1 of which binary)

Problem has 0 rows, 0 columns (0 with objective) and 0 elements
Column breakdown:
0 of type 0.0->inf, 0 of type 0.0->up, 0 of type lo->inf,
0 of type lo->up, 0 of type free, 0 of type fixed,
0 of type -inf->0.0, 0 of type -inf->up, 0 of type 0.0->1.0
Row breakdown:
0 of type E 0.0, 0 of type E 1.0, 0 of type E -1.0,
0 of type E other, 0 of type G 0.0, 0 of type G 1.0,
0 of type G other, 0 of type L 0.0, 0 of type L 1.0,
0 of type L other, 0 of type Range 0.0->1.0, 0 of type Range other,
0 of type Free
Continuous objective value is 0 - 0.00 seconds
Cgl0004I processed model has 0 rows, 0 columns (0 integer (0 of which binary)) and 0 elements
Cbc3007W No integer variables - nothing to do
Cuts at root node changed objective from 0 to -1.79769e+308
Probing was tried 0 times and created 0 cuts of which 0 were active after adding rounds of cuts (0.000 seconds)
Gomory was tried 0 times and created 0 cuts of which 0 were active after adding rounds of cuts (0.000 seconds)
Knapsack was tried 0 times and created 0 cuts of which 0 were active after adding rounds of cuts (0.000 seconds)
Clique was tried 0 times and created 0 cuts of which 0 were active after adding rounds of cuts (0.000 seconds)
MixedIntegerRounding2 was tried 0 times and created 0 cuts of which 0 were active after adding rounds of cuts (0.000 seconds)
FlowCover was tried 0 times and created 0 cuts of which 0 were active after adding rounds of cuts (0.000 seconds)
TwoMirCuts was tried 0 times and created 0 cuts of which 0 were active after adding rounds of cuts (0.000 seconds)

Result - Optimal solution found

Objective value: 0.00000000
Enumerated nodes: 0
Total iterations: 0
Time (CPU seconds): 0.02
Time (Wallclock seconds): 0.00

Total time (CPU seconds): 0.04 (Wallclock seconds): 0.01

/home/kgvx167/anaconda3/envs/python27/lib/python2.7/site-packages/pandas/core/series.py:851: FutureWarning:
Passing list-likes to .loc or [] with any missing label will raise
KeyError in the future, you can use .reindex() as an alternative.

See the documentation here:
https://pandas.pydata.org/pandas-docs/stable/indexing.html#deprecate-loc-reindex-listlike
return self.loc[key]

0:44:57.78 Result dataframe has been constructed...

andras86 · Answer 1 · Sat Dec 15 2018 01:25:03 GMT+0800 (China Standard Time)

As I assume it's not due to massive insert length, it might have to do with read names. Can you put the first few lines of those R1 and R2 fastq files in here? I already don't like the way those filenames look, they don't just differ in R1/R2 but their prefixes are totally different. Could be the case for the read IDs inside them as well.
Since OptiType processes two single-end mapped files and pairs them up later based on the read IDs, if they have different prefixes for R1 and R2 reads, it won't be able to pair them. If that's the case, you'll have to hack something in to make it work. I can help you with that after the weekend.
Btw, have you actually checked if those fastq files belong together? It would be such unusual practice to give corresponding R1 and R2 files different prefixes...

Todd Creasy · Answer 2 · Sat Dec 15 2018 01:27:58 GMT+0800 (China Standard Time)

Oh man. I can't believe I didn't see that. Yes, I don't like the filenames look either :) There is clearly a bug on my end. Thanks for noticing.

Todd Creasy · Answer 3 · Sat Dec 15 2018 02:29:11 GMT+0800 (China Standard Time)

Hi again,

I've fixed the read pair issue but I'm still getting a lot of files with no reads mapping or some other issue. I've put 3 examples below where the results fail. The final .tsv file has only A01:01 A01:01 with 0 reads mapped.

==== ERROR EXAMPLE 1 ====

mapping with 4 threads...

0:00:01.07 Mapping C500.TCGA-DK-AA6M-10A-01D-A394-08.2_gdc_realn_R1.fastq to GEN reference...

0:18:58.54 Mapping C500.TCGA-DK-AA6M-10A-01D-A394-08.2_gdc_realn_R2.fastq to GEN reference...

0:34:20.73 Generating binary hit matrix.
0:34:20.84 Loading /ts18/ngs/studies/TCGA_BLCA_Mutect/optitype/2018_12_14_12_42_58_1.bam started. Number of HLA reads loaded (updated every thousand):
1K...2K...3K...4K...5K.../Biomarker/ngs/software/OptiType/OptiType-1.3.2/hlatyper.py:239: FutureWarning: from_items is deprecated. Please use DataFrame.from_dict(dict(items), ..
.) instead. DataFrame.from_dict(OrderedDict(items)) may be used to preserve the key order.
pos_df = pd.DataFrame.from_items(iter(hits.items())).T

0:34:25.09 5380 reads loaded. Creating dataframe...
0:34:25.62 Dataframes created. Shape: 5380 x 11179, hits: 1576631 (1576631), sparsity: 1 in 38.15
0:34:26.66 Loading /ts18/ngs/studies/TCGA_BLCA_Mutect/optitype/2018_12_14_12_42_58_2.bam started. Number of HLA reads loaded (updated every thousand):
1K...[E::bgzf_read] Read block operation failed with error 2 after 0 of 4 bytes
Traceback (most recent call last):
File "/Biomarker/ngs/software/OptiType/OptiType-1.3.2/OptiTypePipeline.py", line 312, in
pos2, read_details2 = ht.pysam_to_hdf(bam_paths[1])
File "/Biomarker/ngs/software/OptiType/OptiType-1.3.2/hlatyper.py", line 209, in pysam_to_hdf
for aln in sam:
File "pysam/libcalignmentfile.pyx", line 1854, in pysam.libcalignmentfile.AlignmentFile.next
IOError: truncated file

==== ERROR EXAMPLE 2 ====
mapping with 4 threads...

0:00:00.68 Mapping C500.TCGA-FD-A5BX-10A-01D-A26K-08.5_gdc_realn_R1.fastq to GEN reference...

0:19:25.13 Mapping C500.TCGA-FD-A5BX-10A-01D-A26K-08.5_gdc_realn_R2.fastq to GEN reference...

0:36:35.57 Generating binary hit matrix.
0:36:35.79 Loading /ts18/ngs/studies/TCGA_BLCA_Mutect/optitype/2018_12_14_12_42_59_1.bam started. Number of HLA reads loaded (updated every thousand):
1K...2K...3K...4K...5K.../Biomarker/ngs/software/OptiType/OptiType-1.3.2/hlatyper.py:239: FutureWarning: from_items is deprecated. Please use DataFrame.from_dict(dict(items), ..
.) instead. DataFrame.from_dict(OrderedDict(items)) may be used to preserve the key order.
pos_df = pd.DataFrame.from_items(iter(hits.items())).T

0:36:42.89 5909 reads loaded. Creating dataframe...
0:36:43.19 Dataframes created. Shape: 5909 x 11179, hits: 2083469 (2083469), sparsity: 1 in 31.71
Traceback (most recent call last):
File "/Biomarker/ngs/software/OptiType/OptiType-1.3.2/OptiTypePipeline.py", line 312, in
pos2, read_details2 = ht.pysam_to_hdf(bam_paths[1])
File "/Biomarker/ngs/software/OptiType/OptiType-1.3.2/hlatyper.py", line 186, in pysam_to_hdf
sam = pysam.AlignmentFile(samfile, sam_or_bam)
File "pysam/libcalignmentfile.pyx", line 734, in pysam.libcalignmentfile.AlignmentFile.cinit
File "pysam/libcalignmentfile.pyx", line 983, in pysam.libcalignmentfile.AlignmentFile._open
ValueError: file has no sequences defined (mode='rb') - is it SAM/BAM format? Consider opening with check_sq=False

=== ERROR EXAMPLE 3 ===
mapping with 4 threads...

0:00:01.07 Mapping C500.TCGA-4Z-AA7S-10A-01D-A394-08.2_gdc_realn_R1.fastq to GEN reference...

0:18:55.35 Mapping C500.TCGA-4Z-AA7S-10A-01D-A394-08.2_gdc_realn_R2.fastq to GEN reference...

0:39:02.31 Generating binary hit matrix.
0:39:02.35 Loading /ts18/ngs/studies/TCGA_BLCA_Mutect/optitype/2018_12_14_12_42_54_1.bam started. Number of HLA reads loaded (updated every thousand):
1K...2K...3K...4K.../Biomarker/ngs/software/OptiType/OptiType-1.3.2/hlatyper.py:239: FutureWarning: from_items is deprecated. Please use DataFrame.from_dict(dict(items), ...) in
stead. DataFrame.from_dict(OrderedDict(items)) may be used to preserve the key order.
pos_df = pd.DataFrame.from_items(iter(hits.items())).T

0:39:03.89 4359 reads loaded. Creating dataframe...
0:39:04.06 Dataframes created. Shape: 4359 x 11179, hits: 1346154 (1346157), sparsity: 1 in 36.20
0:39:05.59 Loading /ts18/ngs/studies/TCGA_BLCA_Mutect/optitype/2018_12_14_12_42_54_2.bam started. Number of HLA reads loaded (updated every thousand):
1K...2K...
0:39:06.26 2602 reads loaded. Creating dataframe...
0:39:06.36 Dataframes created. Shape: 2602 x 11179, hits: 573528 (573528), sparsity: 1 in 50.72
0:39:06.79 Alignment pairing completed. 0 paired, 6961 unpaired, 0 discordant

WARNING: Less than 10% of reads could be paired. Consider an appropriate unpaired_weight setting in your config file (currently 0.000), because you may need to resort to using u
npaired reads.

0:39:09.49 temporary pruning of identical rows and columns

0:39:09.51 Size of mtx with unique rows and columns: (0, 1)
0:39:09.51 determining minimal set of non-overshadowed alleles

0:39:09.51 Keeping only the minimal number of required alleles (1,)

0:39:09.51 Creating compact model...

starting ilp solver with 4 threads...

0:39:09.51 Initializing OptiType model...
WARNING: Constant objective detected, replacing with a placeholder to prevent
solver failure.
Welcome to the CBC MILP Solver
Version: 2.9.9
Build Date: Oct 13 2018

command line - /home/kgvx167/anaconda3/bin/cbc -threads 4 -printingOptions all -import /home/kgvx167/tmpi3v0gI.pyomo.lp -stat=1 -solve -solu /home/kgvx167/tmpi3v0gI.pyomo.soln (
default strategy 1)
threads was changed from 0 to 4
Option for printingOptions changed from normal to all
CoinLpIO::readLp(): Maximization problem reformulated as minimization
Presolve 0 (-4) rows, 0 (-3) columns and 0 (-5) elements
Statistics for presolved model
Original problem has 1 integers (1 of which binary)

Problem has 0 rows, 0 columns (0 with objective) and 0 elements
Column breakdown:
0 of type 0.0->inf, 0 of type 0.0->up, 0 of type lo->inf,
0 of type lo->up, 0 of type free, 0 of type fixed,
0 of type -inf->0.0, 0 of type -inf->up, 0 of type 0.0->1.0
Row breakdown:
0 of type E 0.0, 0 of type E 1.0, 0 of type E -1.0,
0 of type E other, 0 of type G 0.0, 0 of type G 1.0,
0 of type G other, 0 of type L 0.0, 0 of type L 1.0,
0 of type L other, 0 of type Range 0.0->1.0, 0 of type Range other,
0 of type Free
Continuous objective value is 0 - 0.01 seconds
Cgl0004I processed model has 0 rows, 0 columns (0 integer (0 of which binary)) and 0 elements
Cbc3007W No integer variables - nothing to do
Cuts at root node changed objective from 0 to -1.79769e+308
Probing was tried 0 times and created 0 cuts of which 0 were active after adding rounds of cuts (0.000 seconds)
Gomory was tried 0 times and created 0 cuts of which 0 were active after adding rounds of cuts (0.000 seconds)
Knapsack was tried 0 times and created 0 cuts of which 0 were active after adding rounds of cuts (0.000 seconds)
Clique was tried 0 times and created 0 cuts of which 0 were active after adding rounds of cuts (0.000 seconds)
MixedIntegerRounding2 was tried 0 times and created 0 cuts of which 0 were active after adding rounds of cuts (0.000 seconds)
FlowCover was tried 0 times and created 0 cuts of which 0 were active after adding rounds of cuts (0.000 seconds)
TwoMirCuts was tried 0 times and created 0 cuts of which 0 were active after adding rounds of cuts (0.000 seconds)

Result - Optimal solution found

Objective value: 0.00000000
Enumerated nodes: 0
Total iterations: 0
Time (CPU seconds): 0.04
Time (Wallclock seconds): 0.00

Total time (CPU seconds): 0.09 (Wallclock seconds): 0.01

/home/kgvx167/anaconda3/envs/python27/lib/python2.7/site-packages/pandas/core/series.py:851: FutureWarning:
Passing list-likes to .loc or [] with any missing label will raise
KeyError in the future, you can use .reindex() as an alternative.

See the documentation here:
https://pandas.pydata.org/pandas-docs/stable/indexing.html#deprecate-loc-reindex-listlike
return self.loc[key]

0:39:09.85 Result dataframe has been constructed...

Todd Creasy · Answer 4 · Tue Dec 18 2018 01:05:14 GMT+0800 (China Standard Time)

Here is another error (error example 4):

mapping with 1 threads...

0:00:00.88 Mapping C500.TCGA-CF-A47T-10A-01D-A23U-08.3_gdc_realn_R1.fastq to GEN reference...

0:17:04.69 Mapping C500.TCGA-CF-A47T-10A-01D-A23U-08.3_gdc_realn_R2.fastq to GEN reference...

0:32:02.68 Generating binary hit matrix.
Traceback (most recent call last):
File "/Biomarker/ngs/software/OptiType/OptiType-1.3.2/OptiTypePipeline.py", line 309, in
pos, read_details = ht.pysam_to_hdf(bam_paths[0])
File "/Biomarker/ngs/software/OptiType/OptiType-1.3.2/hlatyper.py", line 186, in pysam_to_hdf
sam = pysam.AlignmentFile(samfile, sam_or_bam)
File "pysam/libcalignmentfile.pyx", line 734, in pysam.libcalignmentfile.AlignmentFile.cinit
File "pysam/libcalignmentfile.pyx", line 944, in pysam.libcalignmentfile.AlignmentFile._open
File "pysam/libchtslib.pyx", line 366, in pysam.libchtslib.HTSFile.check_truncation
IOError: no BGZF EOF marker; file may be truncated

xf78 · Answer 5 · Thu Nov 14 2019 11:33:43 GMT+0800 (China Standard Time)

I have account the same issue like this:

Traceback (most recent call last):
File "/Analysis/bioinfo/xclv/HLA_typing/OptiType/OptiTypePipeline.py", line 312, in
pos2, read_details2 = ht.pysam_to_hdf(bam_paths[1])
File "/Analysis/bioinfo/xclv/HLA_typing/OptiType/hlatyper.py", line 186, in pysam_to_hdf
sam = pysam.AlignmentFile(samfile, sam_or_bam)
File "pysam/libcalignmentfile.pyx", line 736, in pysam.libcalignmentfile.AlignmentFile.cinit
File "pysam/libcalignmentfile.pyx", line 985, in pysam.libcalignmentfile.AlignmentFile._open
ValueError: file has no sequences defined (mode='rb') - is it SAM/BAM format? Consider opening with check_sq=False

At first I thought there was something wrong with my fastq data, but I verified that my data was ok.

my datas are so large(>61M,10X RNA)，so i wonder if it is because my data is too large,and i parallel run OptiType at my datas, so i change my way to run OptiType sample by sample,and this error strangly disappear.