another sumstat error with sqtl nominal data: Boolean index has wrong length

Question

another sumstat error with sqtl nominal data: Boolean index has wrong length

hsun3163 opened this issue 2 years ago · comments

/home/hs3163//sQTL_data_intergration.21.yml False False
Total number of sumstats:  1
{'/home/hs3163//sqtl_chr21.txt': {'ID': 'GENE,CHR,POS,A0,A1', 'CHR': 'chrom', 'POS': 'pos', 'SNP': 'variant_id', 'A0': 'ref', 'A1': 'alt', 'STAT': 'beta', 'SE': 'se', 'P': 'pvalue', 'TSS_D': 'tss_distance', 'maf': 'maf', 'n': 'n', 'ma_samples': 'ma_samples', 'ac': 'ma_count', 'GENE': 'molecular_trait_id', 'molecular_trait_object_id': 'molecular_trait_object_id'}}
/home/hs3163/miniconda3/lib/python3.9/site-packages/cugg/utils.py:27: UserWarning: There are SNPs 138: REF:ALT = ALT:REF. They will be removed.
  warnings.warn("There are SNPs {}: REF:ALT = ALT:REF. They will be removed.".format(sum(indels)))
Total rows of query:  2120962 Total rows of subject:  2524500
---------------------------------------------------------------------------
IndexError                                Traceback (most recent call last)
/tmp/3053576.1.high_mem.q/ipykernel_45765/125517929.py in <module>
     67         subject.index = namebyordA0_A1(subject[["GENE","CHR","POS","A0","A1"]],cols=["GENE","CHR","POS","A0","A1"])
     68         subject = subject.sort_index()
---> 69     nq,_ = snps_match(query,subject,keep_ambiguous)
     70     nq = nq.loc[:,~nq.columns.duplicated()] # Remove duplicated columns due to order of columns difference in subject and query
     71     nqs.append(nq)

/tmp/3053576.1.high_mem.q/ipykernel_45765/125517929.py in snps_match(query, subject, keep_ambiguous)
     15         for g in genes_query.unique():
     16             if g in query.keys() and g in subject.keys():
---> 17                 new_q,new_s = snps_match_dup(query[g],subject[g],keep_ambiguous)
     18                 new_query.append(new_q)
     19                 new_subject.append(new_s)

~/miniconda3/lib/python3.9/site-packages/cugg/utils.py in snps_match_dup(query, subject, keep_ambiguous)
    263     #update beta and snp info
    264     new_query = pd.concat([new_subject.iloc[:,:5],query.loc[pm.qidx].iloc[:,5:]],axis=1)
--> 265     new_query.loc[list(pm.flip) , "STAT"] = -new_query.STAT[list(pm.flip)]
    266     return new_query, new_subject
    267 

~/miniconda3/lib/python3.9/site-packages/pandas/core/series.py in __getitem__(self, key)
    979 
    980         if com.is_bool_indexer(key):
--> 981             key = check_bool_indexer(self.index, key)
    982             key = np.asarray(key, dtype=bool)
    983             return self._get_values(key)

~/miniconda3/lib/python3.9/site-packages/pandas/core/indexing.py in check_bool_indexer(index, key)
   2386         # key may contain nan elements, check_array_indexer needs bool array
   2387         result = pd_array(result, dtype=bool)
-> 2388     return check_array_indexer(index, result)
   2389 
   2390 

~/miniconda3/lib/python3.9/site-packages/pandas/core/indexers/utils.py in check_array_indexer(array, indexer)
    577         # GH26658
    578         if len(indexer) != len(array):
--> 579             raise IndexError(
    580                 f"Boolean index has wrong length: "
    581                 f"{len(indexer)} instead of {len(array)}"

IndexError: Boolean index has wrong length: 38749 instead of 49799

Investigating

hsun3163 commented 2 years ago

fixed

hsun3163 · Answer 1 · Wed Nov 02 2022 03:10:14 GMT+0800 (China Standard Time)

It is not due to the same error as #424

hsun3163 · Answer 2 · Wed Nov 02 2022 03:20:28 GMT+0800 (China Standard Time)

The problem seems to do with how the cluster is named:


chr21:5032760:5033408:clu_58152_+:ENSG00000277117

Originally the code take the first segment after split by ":" ,i.e chr21, but it should be all the segments.