weird HLA results
kylec opened this issue · comments
I'm using optitype/1.3.1, specifying solver=cbc in config.ini, using hla_reference_dna.fasta. i prefilter my reads using razers3. This is the output i'm getting with my own data.
A1 A2 B1 B2 C1 C2 Reads Objective
0 HLA00001 HLA00037 HLA00344 HLA00180 HLA00
433 HLA00401 4305 4072.53
EDIT:
This is the output i get using the test data from the code.
A1 A2 B1 B2 C1 C2 Reads Objective
0 HLA00001 HLA00001 HLA00146 HLA00381 HLA00
433 HLA00430 1156 1135.192
The stdout
filtering for hla region reads for R1
convert filtered bam1 to fastq1
filtering for hla region reads for R2
convert filtered bam1 to fastq1
run hla typing
mapping with 16 threads...
0:00:00.51 Mapping EVL35_1.fastq to GEN reference...
0:00:25.57 Mapping EVL35_2.fastq to GEN reference...
0:00:58.46 Generating binary hit matrix.
Warning: PySam not available on the system. Falling back to primitive SAM par
sing.
0:00:58.46 Loading alleles and read IDs from 01-filter-hla-read/output/EVL35/
2018_01_18_18_39_13/2018_01_18_18_39_13_1.sam...
0:01:03.47 11179 alleles and 6968 reads found.
0:01:03.47 Initializing mapping matrix...
0:01:03.47 6968x11179 mapping matrix initialized. Populating 1549982 hits fro
m SAM file...
10% completed
20% completed
0:04:55.05 1549982 elements filled. Matrix sparsity: 1 in 50.26
Warning: PySam not available on the system. Falling back to primitive SAM par
sing.
0:04:55.42 Loading alleles and read IDs from 01-filter-hla-read/output/EVL35/
2018_01_18_18_39_13/2018_01_18_18_39_13_2.sam...
0:05:00.29 11179 alleles and 6989 reads found.
0:05:00.29 Initializing mapping matrix...
0:05:00.30 6989x11179 mapping matrix initialized. Populating 1519093 hits fro
m SAM file...
10% completed
0:08:47.45 1519093 elements filled. Matrix sparsity: 1 in 51.43
0:08:48.94 Alignment pairing completed. 6164 paired, 1561 unpaired, 34 discor
dant
0:08:52.71 temporary pruning of identical rows and columns
0:08:52.96 Size of mtx with unique rows and columns: (983, 890)
0:08:52.96 determining minimal set of non-overshadowed alleles
0:08:55.61 Keeping only the minimal number of required alleles (77,)
0:08:55.61 Creating compact model...
starting ilp solver with 1 threads...
0:08:55.93 Initializing OptiType model...
Welcome to the CBC MILP Solver
Version: 2.8
Build Date: Aug 5 2015
Revision Number: 2210
command line - /risapps/rhel6/cbc/2.8/bin/cbc -printingOptions all -import /t
mp/tmp2d8uhj.pyomo.lp -import -stat=1 -solve -solu /tmp/tmp2d8uhj.pyomo.soln
(default strategy 1)
Option for printingOptions changed from normal to all
Coin0009I CoinLpIO::readLp(): Maximization problem reformulated as minimizat
ion
Current default (if $ as parameter) for import is /tmp/tmp2d8uhj.pyomo.lp
Presolve 845 (-1) rows, 494 (-1) columns and 3059 (-1) elements
Statistics for presolved model
Problem has 845 rows, 494 columns (458 with objective) and 3059 elements
Column breakdown:
208 of type 0.0->inf, 1 of type 0.0->up, 0 of type lo->inf,
0 of type lo->up, 0 of type free, 0 of type fixed,
0 of type -inf->0.0, 0 of type -inf->up, 285 of type 0.0->1.0
Row breakdown:
0 of type E 0.0, 0 of type E 1.0, 0 of type E -1.0,
0 of type E other, 0 of type G 0.0, 6 of type G 1.0,
0 of type G other, 624 of type L 0.0, 0 of type L 1.0,
215 of type L other, 0 of type Range 0.0->1.0, 0 of type Range other,
0 of type Free
Continuous objective value is -4072.53 - 0.01 seconds
Cgl0004I processed model has 839 rows, 494 columns (285 integer) and 2982 ele
ments
Cbc0038I Solution found of -4072.53
Cbc0038I Before mini branch and bound, 285 integers at bound fixed and 25 con
tinuous
Cbc0038I Mini branch and bound did not improve solution (0.02 seconds)
Cbc0038I After 0.02 seconds - Feasibility pump exiting with objective of -407
2.53 - took 0.00 seconds
Cbc0012I Integer solution of -4072.53 found by feasibility pump after 0 itera
tions and 0 nodes (0.02 seconds)
Cbc0001I Search completed - best objective -4072.530000000001, took 0 iterati
ons and 0 nodes (0.02 seconds)
Cbc0035I Maximum depth 0, 0 variables fixed on reduced cost
Cuts at root node changed objective from -4072.53 to -4072.53
Probing was tried 0 times and created 0 cuts of which 0 were active after add
ing rounds of cuts (0.000 seconds)
Gomory was tried 0 times and created 0 cuts of which 0 were active after addi
ng rounds of cuts (0.000 seconds)
Knapsack was tried 0 times and created 0 cuts of which 0 were active after ad
ding rounds of cuts (0.000 seconds)
Clique was tried 0 times and created 0 cuts of which 0 were active after addi
ng rounds of cuts (0.000 seconds)
MixedIntegerRounding2 was tried 0 times and created 0 cuts of which 0 were ac
tive after adding rounds of cuts (0.000 seconds)
FlowCover was tried 0 times and created 0 cuts of which 0 were active after a
dding rounds of cuts (0.000 seconds)
TwoMirCuts was tried 0 times and created 0 cuts of which 0 were active after
adding rounds of cuts (0.000 seconds)
Result - Optimal solution found
Objective value: -4072.53000000
Enumerated nodes: 0
Total iterations: 0
Time (CPU seconds): 0.03
Time (Wallclock seconds): 0.03
Total time (CPU seconds): 0.03 (Wallclock seconds): 0.04
0:08:56.29 Result dataframe has been constructed...
Dear Kyle,
For some reason the IMGT sequence identifiers did not get converted back to the customary HLA allele naming format, which for the first sample should be A*01:01 A*03:01 B*51:01 B*15:17 C*07:01 C*01:02
.
The culprit is the get_types
function in OptiTypePipeline.py but I haven't seen this behavior before. Can you tell me what Python version you're using? What happens if you replace the first two lines (lines 151 and 152) of get_types
with allele_id = str(allele_id)
? Can you just run it on the test file and let me know what happens? Also, do the proper allele names show up on the coverage plot pdf?
I'm using Python 2.7.13 :: Continuum Analytics, Inc. The coverage plot shows proper allele names.
I added
def get_types(allele_id):
allele_id = str(allele_id)
aa = allele_id.split('_')
if len(aa) == 1:
return table.loc[aa[0]]['4digit']
else:
return table.loc[aa[0]]['4digit'] #+ '/' + table.loc[aa[1]]['4digit']
output
...
Result - Optimal solution found
Objective value: -1135.19200000
Enumerated nodes: 0
Total iterations: 0
Time (CPU seconds): 0.02
Time (Wallclock seconds): 0.02
Total time (CPU seconds): 0.02 (Wallclock seconds): 0.02
0:00:40.91 Result dataframe has been constructed...
Traceback (most recent call last):
File "/rsrch2/ccp_rsch/kchang3/software/OptiType/OptiTypePipeline.py", line 411, in <module>
result_4digit = result.applymap(get_types)
File "/rsrch2/ccp_rsch/kchang3/.conda/envs/optitype-env/lib/python2.7/site-packages/pandas/core/frame.py", line 445
3, in applymap
return self.apply(infer)
File "/rsrch2/ccp_rsch/kchang3/.conda/envs/optitype-env/lib/python2.7/site-packages/pandas/core/frame.py", line 426
2, in apply
ignore_failures=ignore_failures)
File "/rsrch2/ccp_rsch/kchang3/.conda/envs/optitype-env/lib/python2.7/site-packages/pandas/core/frame.py", line 435
8, in _apply_standard
results[i] = func(v)
File "/rsrch2/ccp_rsch/kchang3/.conda/envs/optitype-env/lib/python2.7/site-packages/pandas/core/frame.py", line 445
1, in infer
return lib.map_infer(x.asobject, func)
File "pandas/_libs/src/inference.pyx", line 1574, in pandas._libs.lib.map_infer (pandas/_libs/lib.c:66645)
File "/rsrch2/ccp_rsch/kchang3/software/OptiType/OptiTypePipeline.py", line 157, in get_types
return table.loc[aa[0]]['4digit']
File "/rsrch2/ccp_rsch/kchang3/.conda/envs/optitype-env/lib/python2.7/site-packages/pandas/core/indexing.py", line
1328, in __getitem__
return self._getitem_axis(key, axis=0)
File "/rsrch2/ccp_rsch/kchang3/.conda/envs/optitype-env/lib/python2.7/site-packages/pandas/core/indexing.py", line
1551, in _getitem_axis
self._has_valid_type(key, axis)
File "/rsrch2/ccp_rsch/kchang3/.conda/envs/optitype-env/lib/python2.7/site-packages/pandas/core/indexing.py", line
1442, in _has_valid_type
error()
File "/rsrch2/ccp_rsch/kchang3/.conda/envs/optitype-env/lib/python2.7/site-packages/pandas/core/indexing.py", line
1429, in error
(key, self.obj._get_axis_name(axis)))
KeyError: ('the label [1135.192] is not in the [index]', u'occurred at index obj')
Attached is my env
optitype_env.txt
Same problem: "the IMGT sequence identifiers did not get converted back to the customary HLA allele naming format".
A primary debug shows that pandas version may be relevant.
A simple change of the 'get_types' function seems to work:
Change the 151 line from
if not isinstance(allele_id, str):
to
if isinstance(allele_id, float):