icbi-lab / infercnvpy

Infer copy number variation (CNV) from scRNA-seq data. Plays nicely with Scanpy.

Home Page:https://infercnvpy.readthedocs.io/en/latest/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

ValueError: need at least one array to concatenate

mibo1996 opened this issue · comments

Hi,

I am trying to use infercnvpy on my single-cell dataset, but when I run:
cnv.tl.infercnv(
adata,
reference_key="leiden_1.0",
reference_cat='12',
)

I get the error message: ValueError: need at least one array to concatenate.

When I run:
adata.var.loc[:, ["gene_ids", "chromosome", "start", "end"]].head()
I can see that these columns were successfully added to adata.var

When I run:
adata.obs.head()
I can see that there is a column, 'leiden_1.0' with clusters numbered '0'-'16'.

I'm not sure what this error message means or what I can do differently to fix it.

Thank you

I am also getting the same error, running:
infercnvpy.tl.infercnv(seu_epi, reference_key='seurat_clusters', reference_cat=['0','1'], window_size=200)

ValueError: need at least one array to concatenate

seu_epi
AnnData object with n_obs × n_vars = 5136 × 16784
obs: 'orig.ident', 'nCount_RNA', 'nFeature_RNA', 'percent.mito', 'nCount_SCT', 'nFeature_SCT', 'S.Score', 'G2M.Score', 'Phase', 'CC.Difference', 'SCT_snn_res.0.8', 'seurat_clusters', 'tree.ident', 'condition', 'patientID', 'pANN', 'DF', 'SCT_snn_res.1', 'SCT_snn_res.0.6'
var: 'sct.detection_rate', 'sct.gmean', 'sct.variance', 'sct.residual_mean', 'sct.residual_variance', 'sct.variable', 'chromosome', 'start', 'end', 'gene_id', 'gene_name'
uns: 'neighbors'
obsm: 'X_pca', 'X_umap'
varm: 'PCs'
obsp: 'distances'

seu_epi.X
array([[-0.12743472, -0.12882212, -0.06847287, ..., -0.03828714,
-0.01745232, -0.03993058],
[-0.10143117, -0.10215754, -0.0520843 , ..., -0.0263906 ,
-0.00942162, -0.02850649],
[-0.1182593 , -0.11941153, -0.06270597, ..., -0.03406237,
-0.01457542, -0.03586897],
...,
[-0.10151398, -0.10224243, -0.05213675, ..., -0.02642813,
-0.00944661, -0.02854246],
[-0.06104027, -0.06079169, -0.02616703, ..., -0.00829963,
0.00237056, -0.01121682],
[-0.07975152, -0.07994411, -0.03827017, ..., -0.01663157,
-0.00299579, -0.01916628]])

seu_epi.obs.loc[:, ['seurat_clusters','condition']].head()
seurat_clusters condition
AACTGGTCATTCACTT-1 17 tumor_appendix
ACGCCGATCTGAAAGA-1 3 tumor_appendix
ACTGATGAGCACCGTC-1 3 tumor_appendix
AGCATACTCTACTCAT-1 10 tumor_appendix
CAACCTCCACAACTGT-1 7 tumor_appendix

Hi,

thanks for reporting, I'll look into it.
Could you please provide the full stacktrace of the error (together with the error message, the line numbers in the code where the error occurs should show)? That would be extremely helpful for debugging!

Hi Gregor - thanks for looking into this! Stacktrace below:

/venvs/anaconda/envs/groupenvs/infercnv-py/lib/python3.8/site-packages/anndata/c                                            ompat/__init__.py:180: FutureWarning: Moving element from .uns['neighbors']['dis                                                     tances'] to .obsp['distances'].

This is where adjacency matrices should go now.
  warn(
INFO:root:Extracted GTF attributes: ['gene_id', 'gene_name']
WARNING: GTF file misses annotation for 13 genes in adata.
WARNING: Skipped 13 genes because of duplicate identifiers in GTF file.
... storing 'orig.ident' as categorical
... storing 'Phase' as categorical
... storing 'seurat_clusters' as categorical
... storing 'condition' as categorical
... storing 'patientID' as categorical
... storing 'DF' as categorical
... storing 'chromosome' as categorical
... storing 'gene_id' as categorical
... storing 'gene_name' as categorical
WARNING: Skipped 26 genes because they don't have a genomic position annotated.
/venvs/anaconda/envs/groupenvs/infercnv-py/lib/python3.8/site-packages/numpy/lib                                                     /arraysetops.py:583: FutureWarning: elementwise comparison failed; returning sca                                                     lar instead, but in the future will perform elementwise comparison
  mask |= (ar1 == a)

Traceback (most recent call last):
  File "run_infercnv.py", line 18, in <module>
    infercnvpy.tl.infercnv(seu_epi, reference_key='seurat_clusters', reference_c                                                     at=[0,1], window_size=200)
  File "/venvs/anaconda/envs/groupenvs/infercnv-py/lib/python3.8/site-packages/i                                                     nfercnvpy/tl/_infercnv.py", line 164, in infercnv
    reference = _get_reference(tmp_adata, reference_key, reference_cat, referenc                                                     e)
  File "/venvs/anaconda/envs/groupenvs/infercnv-py/lib/python3.8/site-packages/i                                                     nfercnvpy/tl/_infercnv.py", line 305, in _get_reference
    raise ValueError(
ValueError: The following reference categories were not found in adata.obs[refer                                                     ence_key]: [0 1]
(/venvs/anaconda/envs/groupenvs/infercnv-py) sgrimes@ikura:C03_210423_inferCNV_p                                                     yTest$ python run_infercnv.py
/venvs/anaconda/envs/groupenvs/infercnv-py/lib/python3.8/site-packages/anndata/c                                                     ompat/__init__.py:180: FutureWarning: Moving element from .uns['neighbors']['dis                                                     tances'] to .obsp['distances'].

This is where adjacency matrices should go now.
  warn(
INFO:root:Extracted GTF attributes: ['gene_id', 'gene_name']
WARNING: GTF file misses annotation for 13 genes in adata.
WARNING: Skipped 13 genes because of duplicate identifiers in GTF file.
... storing 'orig.ident' as categorical
... storing 'Phase' as categorical
... storing 'seurat_clusters' as categorical
... storing 'condition' as categorical
... storing 'patientID' as categorical
... storing 'DF' as categorical
... storing 'chromosome' as categorical
... storing 'gene_id' as categorical
... storing 'gene_name' as categorical
WARNING: Skipped 26 genes because they don't have a genomic position annotated.
  0%|                                                                                                          | 0/2 [00:11<?, ?it/s]
concurrent.futures.process._RemoteTraceback:
"""
Traceback (most recent call last):
  File "/venvs/anaconda/envs/groupenvs/infercnv-py/lib/python3.8/concurrent/futures/process.py", line 239, in _process_worker
    r = call_item.fn(*call_item.args, **call_item.kwargs)
  File "/venvs/anaconda/envs/groupenvs/infercnv-py/lib/python3.8/concurrent/futures/process.py", line 198, in _process_chunk
    return [fn(*args) for args in chunk]
  File "/venvs/anaconda/envs/groupenvs/infercnv-py/lib/python3.8/concurrent/futures/process.py", line 198, in <listcomp>
    return [fn(*args) for args in chunk]
  File "/venvs/anaconda/envs/groupenvs/infercnv-py/lib/python3.8/site-packages/infercnvpy/tl/_infercnv.py", line 353, in _infercnv_chunk
    chr_pos, x_smoothed = _running_mean_by_chromosome(
  File "/venvs/anaconda/envs/groupenvs/infercnv-py/lib/python3.8/site-packages/infercnvpy/tl/_infercnv.py", line 273, in _running_mean_by_chromosome
    return chr_start_pos, np.hstack(running_means)
  File "<__array_function__ internals>", line 5, in hstack
  File "/venvs/anaconda/envs/groupenvs/infercnv-py/lib/python3.8/site-packages/numpy/core/shape_base.py", line 346, in hstack
    return _nx.concatenate(arrs, 1)
  File "<__array_function__ internals>", line 5, in concatenate
ValueError: need at least one array to concatenate
"""

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "run_infercnv.py", line 18, in <module>
    infercnvpy.tl.infercnv(seu_epi, reference_key='seurat_clusters', reference_cat=['0','1'], window_size=200)
  File "/venvs/anaconda/envs/groupenvs/infercnv-py/lib/python3.8/site-packages/infercnvpy/tl/_infercnv.py", line 169, in infercnv
    *process_map(
  File "/venvs/anaconda/envs/groupenvs/infercnv-py/lib/python3.8/site-packages/tqdm/contrib/concurrent.py", line 130, in process_map
    return _executor_map(ProcessPoolExecutor, fn, *iterables, **tqdm_kwargs)
  File "/venvs/anaconda/envs/groupenvs/infercnv-py/lib/python3.8/site-packages/tqdm/contrib/concurrent.py", line 76, in _executor_map
    return list(tqdm_class(ex.map(fn, *iterables, **map_args), **kwargs))
  File "/venvs/anaconda/envs/groupenvs/infercnv-py/lib/python3.8/site-packages/tqdm/std.py", line 1178, in __iter__
    for obj in iterable:
  File "/venvs/anaconda/envs/groupenvs/infercnv-py/lib/python3.8/concurrent/futures/process.py", line 484, in _chain_from_iterable_of_lists
    for element in iterable:
  File "/venvs/anaconda/envs/groupenvs/infercnv-py/lib/python3.8/concurrent/futures/_base.py", line 611, in result_iterator
    yield fs.pop().result()
  File "/venvs/anaconda/envs/groupenvs/infercnv-py/lib/python3.8/concurrent/futures/_base.py", line 439, in result
    return self.__get_result()
  File "/venvs/anaconda/envs/groupenvs/infercnv-py/lib/python3.8/concurrent/futures/_base.py", line 388, in __get_result
    raise self._exception
ValueError: need at least one array to concatenate

In case it factors into the troubleshooting, my AnnData object started out as a Seurat 3.x object, updated to Seurat 4.0, then converted to hd5, then to AnnData.

Can you please post the result of

adata.obs["chromosome"].unique()

right before running tl.infercnv?

I assume you mean adata.var["chromosome"].unique()? Results below. I see that there is a nan entry, probably due to the duplicate genes being removed. Is that the issue?

seu_epi.var["chromosome"].unique()
array(['1', nan, '2', '3', '4', '5', '6', '7', 'X', '8', '9', '11', '10',
'12', '13', '14', '15', '16', '17', '18', '20', '19', 'Y', '22',
'21', 'MT', 'GL000194.1', 'GL000195.1', 'GL000219.1', 'KI270734.1',
'GL000218.1', 'KI270726.1', 'KI270711.1', 'KI270721.1'],
dtype=object)

Thanks! The problem is that I expected all chromosomes (as opposed to scaffolds) to start with chr.

chromosomes = _natural_sort(
[x for x in var["chromosome"].unique() if x.startswith("chr") and x != "chrM"]
)

What GTF file are you using? I only tried with GENCODE I think.

Ah, glad it's an easy explanation! I'm using the gtf file provided for cellranger by 10X genomics. I think it is ensembl. The first few lines below:

#!genome-build GRCh38.p12

#!genome-version GRCh38

#!genome-date 2013-12

#!genome-build-accession NCBI:GCA_000001405.27

#!genebuild-last-updated 2018-01

1 havana gene 29554 31109 . + . gene_id "ENSG00000243485"; gene_version "5"; gene_name "MIR1302-2HG"; gene_source "havana"; gene_biotype "lincRNA"

1 havana transcript 29554 31097 . + . gene_id "ENSG00000243485"; gene_version "5"; transcript_id "ENST00000473358"; transcript_version "1"; gene_name "MIR1302-2HG"; gene_source "havana"; gene_biotype "lincRNA"; transcript_name "MIR1302-2HG-202"; transcript_source "havana"; transcript_biotype "lincRNA"; tag "basic"; transcript_support_level "5"

1 havana exon 29554 30039 . + . gene_id "ENSG00000243485"; gene_version "5"; transcript_id "ENST00000473358"; transcript_version "1"; exon_number "1"; gene_name "MIR1302-2HG"; gene_source "havana"; gene_biotype "lincRNA"; transcript_name "MIR1302-2HG-202"; transcript_source "havana"; transcript_biotype "lincRNA"; exon_id "ENSE00001947070"; exon_version "1"; tag "basic"; transcript_support_level "5"

1 havana exon 30564 30667 . + . gene_id "ENSG00000243485"; gene_version "5"; transcript_id "ENST00000473358"; transcript_version "1"; exon_number "2"; gene_name "MIR1302-2HG"; gene_source "havana"; gene_biotype "lincRNA"; transcript_name "MIR1302-2HG-202"; transcript_source "havana"; transcript_biotype "lincRNA"; exon_id "ENSE00001922571"; exon_version "1"; tag "basic"; transcript_support_level "5"

1 havana exon 30976 31097 . + . gene_id "ENSG00000243485"; gene_version "5"; transcript_id "ENST00000473358"; transcript_version "1"; exon_number "3"; gene_name "MIR1302-2HG"; gene_source "havana"; gene_biotype "lincRNA"; transcript_name "MIR1302-2HG-202"; transcript_source "havana"; transcript_biotype "lincRNA"; exon_id "ENSE00001827679"; exon_version "1"; tag "basic"; transcript_support_level "5"

1 havana transcript 30267 31109 . + . gene_id "ENSG00000243485"; gene_version "5"; transcript_id "ENST00000469289"; transcript_version "1"; gene_name "MIR1302-2HG"; gene_source "havana"; gene_biotype "lincRNA"; transcript_name "MIR1302-2HG-201"; transcript_source "havana"; transcript_biotype "lincRNA"; tag "basic"; transcript_support_level "5"

1 havana exon 30267 30667 . + . gene_id "ENSG00000243485"; gene_version "5"; transcript_id "ENST00000469289"; transcript_version "1"; exon_number "1"; gene_name "MIR1302-2HG"; gene_source "havana"; gene_biotype "lincRNA"; transcript_name "MIR1302-2HG-201"; transcript_source "havana"; transcript_biotype "lincRNA"; exon_id "ENSE00001841699"; exon_version "1"; tag "basic"; transcript_support_level "5"

Ok, I'll look into improving the genomic_position_from_gtf function at some point that it supports other GTF files as well.
Let's keep the issue open until then.

In the meanwhile, you could try a GENCODE file, or just modify the chromosome column in adata.var such that it has the chr prefix.

ok got it. I will try one of those two things. Thanks for being so responsive looking into this!

In the meantime a small and dirty trick:
adata.var['chromosome'] = ['chr'+str(i) for i in adata.var['chromosome']]

Then it worked for me.