weird HLA results

Question

weird HLA results

kylec opened this issue 6 years ago · comments

I'm using optitype/1.3.1, specifying solver=cbc in config.ini, using hla_reference_dna.fasta. i prefilter my reads using razers3. This is the output i'm getting with my own data.

        A1      A2      B1      B2      C1      C2      Reads   Objective
0       HLA00001        HLA00037        HLA00344        HLA00180        HLA00
433     HLA00401        4305    4072.53

EDIT:
This is the output i get using the test data from the code.

        A1      A2      B1      B2      C1      C2      Reads   Objective
0       HLA00001        HLA00001        HLA00146        HLA00381        HLA00
433     HLA00430        1156    1135.192

The stdout
filtering for hla region reads for R1
convert filtered bam1 to fastq1
filtering for hla region reads for R2
convert filtered bam1 to fastq1
run hla typing

mapping with 16 threads...

0:00:00.51 Mapping EVL35_1.fastq to GEN reference...

0:00:25.57 Mapping EVL35_2.fastq to GEN reference...

0:00:58.46 Generating binary hit matrix.
Warning: PySam not available on the system. Falling back to primitive SAM par
sing.
0:00:58.46 Loading alleles and read IDs from 01-filter-hla-read/output/EVL35/
2018_01_18_18_39_13/2018_01_18_18_39_13_1.sam...
0:01:03.47 11179 alleles and 6968 reads found.
0:01:03.47 Initializing mapping matrix...
0:01:03.47 6968x11179 mapping matrix initialized. Populating 1549982 hits fro
m SAM file...
10% completed
20% completed
0:04:55.05 1549982 elements filled. Matrix sparsity: 1 in 50.26
Warning: PySam not available on the system. Falling back to primitive SAM par
sing.
0:04:55.42 Loading alleles and read IDs from 01-filter-hla-read/output/EVL35/
2018_01_18_18_39_13/2018_01_18_18_39_13_2.sam...
0:05:00.29 11179 alleles and 6989 reads found.
0:05:00.29 Initializing mapping matrix...
0:05:00.30 6989x11179 mapping matrix initialized. Populating 1519093 hits fro
m SAM file...
10% completed
0:08:47.45 1519093 elements filled. Matrix sparsity: 1 in 51.43
0:08:48.94 Alignment pairing completed. 6164 paired, 1561 unpaired, 34 discor
dant

0:08:52.71 temporary pruning of identical rows and columns

0:08:52.96 Size of mtx with unique rows and columns: (983, 890)
0:08:52.96 determining minimal set of non-overshadowed alleles

0:08:55.61 Keeping only the minimal number of required alleles (77,)

0:08:55.61 Creating compact model...

starting ilp solver with 1 threads...

0:08:55.93 Initializing OptiType model...
Welcome to the CBC MILP Solver
Version: 2.8
Build Date: Aug 5 2015
Revision Number: 2210

command line - /risapps/rhel6/cbc/2.8/bin/cbc -printingOptions all -import /t
mp/tmp2d8uhj.pyomo.lp -import -stat=1 -solve -solu /tmp/tmp2d8uhj.pyomo.soln
(default strategy 1)
Option for printingOptions changed from normal to all
Coin0009I CoinLpIO::readLp(): Maximization problem reformulated as minimizat
ion
Current default (if $ as parameter) for import is /tmp/tmp2d8uhj.pyomo.lp
Presolve 845 (-1) rows, 494 (-1) columns and 3059 (-1) elements
Statistics for presolved model

Problem has 845 rows, 494 columns (458 with objective) and 3059 elements
Column breakdown:
208 of type 0.0->inf, 1 of type 0.0->up, 0 of type lo->inf,
0 of type lo->up, 0 of type free, 0 of type fixed,
0 of type -inf->0.0, 0 of type -inf->up, 285 of type 0.0->1.0
Row breakdown:
0 of type E 0.0, 0 of type E 1.0, 0 of type E -1.0,
0 of type E other, 0 of type G 0.0, 6 of type G 1.0,
0 of type G other, 624 of type L 0.0, 0 of type L 1.0,
215 of type L other, 0 of type Range 0.0->1.0, 0 of type Range other,
0 of type Free
Continuous objective value is -4072.53 - 0.01 seconds
Cgl0004I processed model has 839 rows, 494 columns (285 integer) and 2982 ele
ments
Cbc0038I Solution found of -4072.53
Cbc0038I Before mini branch and bound, 285 integers at bound fixed and 25 con
tinuous
Cbc0038I Mini branch and bound did not improve solution (0.02 seconds)
Cbc0038I After 0.02 seconds - Feasibility pump exiting with objective of -407
2.53 - took 0.00 seconds
Cbc0012I Integer solution of -4072.53 found by feasibility pump after 0 itera
tions and 0 nodes (0.02 seconds)
Cbc0001I Search completed - best objective -4072.530000000001, took 0 iterati
ons and 0 nodes (0.02 seconds)
Cbc0035I Maximum depth 0, 0 variables fixed on reduced cost
Cuts at root node changed objective from -4072.53 to -4072.53
Probing was tried 0 times and created 0 cuts of which 0 were active after add
ing rounds of cuts (0.000 seconds)
Gomory was tried 0 times and created 0 cuts of which 0 were active after addi
ng rounds of cuts (0.000 seconds)
Knapsack was tried 0 times and created 0 cuts of which 0 were active after ad
ding rounds of cuts (0.000 seconds)
Clique was tried 0 times and created 0 cuts of which 0 were active after addi
ng rounds of cuts (0.000 seconds)
MixedIntegerRounding2 was tried 0 times and created 0 cuts of which 0 were ac
tive after adding rounds of cuts (0.000 seconds)
FlowCover was tried 0 times and created 0 cuts of which 0 were active after a
dding rounds of cuts (0.000 seconds)
TwoMirCuts was tried 0 times and created 0 cuts of which 0 were active after
adding rounds of cuts (0.000 seconds)

Result - Optimal solution found

Objective value: -4072.53000000
Enumerated nodes: 0
Total iterations: 0
Time (CPU seconds): 0.03
Time (Wallclock seconds): 0.03

Total time (CPU seconds): 0.03 (Wallclock seconds): 0.04

0:08:56.29 Result dataframe has been constructed...

andras86 · Answer 1 · Wed Jan 24 2018 17:04:44 GMT+0800 (China Standard Time)

Dear Kyle,

For some reason the IMGT sequence identifiers did not get converted back to the customary HLA allele naming format, which for the first sample should be A*01:01 A*03:01 B*51:01 B*15:17 C*07:01 C*01:02.

The culprit is the get_types function in OptiTypePipeline.py but I haven't seen this behavior before. Can you tell me what Python version you're using? What happens if you replace the first two lines (lines 151 and 152) of get_types with allele_id = str(allele_id)? Can you just run it on the test file and let me know what happens? Also, do the proper allele names show up on the coverage plot pdf?

kylec · Answer 2 · Thu Jan 25 2018 02:11:02 GMT+0800 (China Standard Time)

I'm using Python 2.7.13 :: Continuum Analytics, Inc. The coverage plot shows proper allele names.
I added

def get_types(allele_id):
    allele_id = str(allele_id)
    aa = allele_id.split('_')
    if len(aa) == 1:
        return table.loc[aa[0]]['4digit']
    else:
        return table.loc[aa[0]]['4digit']  #+ '/' + table.loc[aa[1]]['4digit']

output

...
Result - Optimal solution found

Objective value:                -1135.19200000
Enumerated nodes:               0
Total iterations:               0
Time (CPU seconds):             0.02
Time (Wallclock seconds):       0.02

Total time (CPU seconds):       0.02   (Wallclock seconds):       0.02


 0:00:40.91 Result dataframe has been constructed...
Traceback (most recent call last):
  File "/rsrch2/ccp_rsch/kchang3/software/OptiType/OptiTypePipeline.py", line 411, in <module>
    result_4digit = result.applymap(get_types)
  File "/rsrch2/ccp_rsch/kchang3/.conda/envs/optitype-env/lib/python2.7/site-packages/pandas/core/frame.py", line 445
3, in applymap
    return self.apply(infer)
  File "/rsrch2/ccp_rsch/kchang3/.conda/envs/optitype-env/lib/python2.7/site-packages/pandas/core/frame.py", line 426
2, in apply
    ignore_failures=ignore_failures)
  File "/rsrch2/ccp_rsch/kchang3/.conda/envs/optitype-env/lib/python2.7/site-packages/pandas/core/frame.py", line 435
8, in _apply_standard
    results[i] = func(v)
  File "/rsrch2/ccp_rsch/kchang3/.conda/envs/optitype-env/lib/python2.7/site-packages/pandas/core/frame.py", line 445
1, in infer
    return lib.map_infer(x.asobject, func)
  File "pandas/_libs/src/inference.pyx", line 1574, in pandas._libs.lib.map_infer (pandas/_libs/lib.c:66645)
  File "/rsrch2/ccp_rsch/kchang3/software/OptiType/OptiTypePipeline.py", line 157, in get_types
    return table.loc[aa[0]]['4digit']
  File "/rsrch2/ccp_rsch/kchang3/.conda/envs/optitype-env/lib/python2.7/site-packages/pandas/core/indexing.py", line
1328, in __getitem__
    return self._getitem_axis(key, axis=0)
  File "/rsrch2/ccp_rsch/kchang3/.conda/envs/optitype-env/lib/python2.7/site-packages/pandas/core/indexing.py", line
1551, in _getitem_axis
    self._has_valid_type(key, axis)
  File "/rsrch2/ccp_rsch/kchang3/.conda/envs/optitype-env/lib/python2.7/site-packages/pandas/core/indexing.py", line
1442, in _has_valid_type
    error()
  File "/rsrch2/ccp_rsch/kchang3/.conda/envs/optitype-env/lib/python2.7/site-packages/pandas/core/indexing.py", line
1429, in error
    (key, self.obj._get_axis_name(axis)))
KeyError: ('the label [1135.192] is not in the [index]', u'occurred at index obj')

Attached is my env
optitype_env.txt

Bin Song · Answer 3 · Fri Apr 17 2020 15:51:23 GMT+0800 (China Standard Time)

Same problem: "the IMGT sequence identifiers did not get converted back to the customary HLA allele naming format".
A primary debug shows that pandas version may be relevant.
A simple change of the 'get_types' function seems to work:
Change the 151 line from

if not isinstance(allele_id, str):

to

if isinstance(allele_id, float):