need at least one array to concatenate when doing cnv.tl.infercnv()

Question

need at least one array to concatenate when doing cnv.tl.infercnv()

hyjforesight opened this issue 3 years ago · comments

Hello infercnvpy,
Thanks for developing this amazing package!
I have 2 datasets. One is normal cells--WI, the other is query cells--ACI. These 2 datasets have been cleaned and UMAPed by Scanpy. I merge them together without doing BBKNN integration. Then I call the cnv.tl.infercnv() function and use WI as reference, but it generates errors 'need at least one array to concatenate'.

Could you please help me to solve this issue?
Thanks!
Best,
YJ

Here is my coding and outputs:
ACI = sc.read('C:/Users/Park_Lab/Documents/ACI_sub2.h5ad')
ACI
AnnData object with n_obs × n_vars = 14497 × 5000
obs: 'leiden', 'n_genes', 'n_genes_by_counts', 'total_counts', 'total_counts_mt', 'pct_counts_mt', 'total_counts_rpl', 'pct_counts_rpl', 'total_counts_rps', 'pct_counts_rps'
var: 'Accession', 'Chromosome', 'End', 'Start', 'Strand', 'n_cells', 'mt', 'rpl', 'rps', 'n_cells_by_counts', 'mean_counts', 'pct_dropout_by_counts', 'total_counts', 'highly_variable', 'means', 'dispersions', 'dispersions_norm', 'mean', 'std'
uns: 'hvg', 'leiden', 'leiden_colors', 'neighbors', 'pca', 'rank_genes_groups', 'umap'
obsm: 'X_pca', 'X_umap'
varm: 'PCs'
layers: 'ambiguous', 'matrix', 'spliced', 'unspliced'
obsp: 'connectivities', 'distances'
WI = sc.read('C:/Users/Park_Lab/Documents/WI_sub.h5ad')
WI
AnnData object with n_obs × n_vars = 10681 × 5000
obs: 'leiden', 'n_genes', 'n_genes_by_counts', 'total_counts', 'total_counts_mt', 'pct_counts_mt', 'total_counts_rpl', 'pct_counts_rpl', 'total_counts_rps', 'pct_counts_rps'
var: 'Accession', 'Chromosome', 'End', 'Start', 'Strand', 'n_cells', 'mt', 'rpl', 'rps', 'n_cells_by_counts', 'mean_counts', 'pct_dropout_by_counts', 'total_counts', 'highly_variable', 'means', 'dispersions', 'dispersions_norm', 'mean', 'std'
uns: 'hvg', 'leiden', 'leiden_colors', 'neighbors', 'pca', 'rank_genes_groups', 'umap'
obsm: 'X_pca', 'X_umap'
varm: 'PCs'
layers: 'ambiguous', 'matrix', 'spliced', 'unspliced'
obsp: 'connectivities', 'distances'
adata = ACI.concatenate(WI, batch_categories=['ACI', 'WI'])
adata
AnnData object with n_obs × n_vars = 25178 × 3362
obs: 'leiden', 'n_genes', 'n_genes_by_counts', 'total_counts', 'total_counts_mt', 'pct_counts_mt', 'total_counts_rpl', 'pct_counts_rpl', 'total_counts_rps', 'pct_counts_rps', 'batch'
var: 'Accession', 'Chromosome', 'End', 'Start', 'Strand', 'mt', 'rpl', 'rps', 'highly_variable', 'n_cells-ACI', 'n_cells_by_counts-ACI', 'mean_counts-ACI', 'pct_dropout_by_counts-ACI', 'total_counts-ACI', 'means-ACI', 'dispersions-ACI', 'dispersions_norm-ACI', 'mean-ACI', 'std-ACI', 'n_cells-WI', 'n_cells_by_counts-WI', 'mean_counts-WI', 'pct_dropout_by_counts-WI', 'total_counts-WI', 'means-WI', 'dispersions-WI', 'dispersions_norm-WI', 'mean-WI', 'std-WI'
obsm: 'X_pca', 'X_umap'
layers: 'ambiguous', 'matrix', 'spliced', 'unspliced'
sc.pl.umap(adata, color=['batch', 'leiden'], legend_loc='right margin', frameon=False, title='', use_raw=False)

adata.var['chromosome']=adata.var['Chromosome']
adata.var['start']=adata.var['Start']
adata.var['end']=adata.var['End']
cnv.tl.infercnv(adata, reference_key='batch', reference_cat='WI', n_jobs=16)

package infor:
scanpy==1.8.2 anndata==0.7.8 umap==0.5.2 numpy==1.20.3 scipy==1.7.3 pandas==1.3.4 scikit-learn==1.0.1 statsmodels==0.13.1 python-igraph==0.9.8 pynndescent==0.5.5
scvelo==0.2.4 scanpy==1.8.2 anndata==0.7.8 loompy==3.0.6 numpy==1.20.3 scipy==1.7.3 matplotlib==3.5.0 sklearn==1.0.1 pandas==1.3.4
cellrank==1.5.0 scanpy==1.8.2 anndata==0.7.8 numpy==1.20.3 numba==0.54.1 scipy==1.7.3 pandas==1.3.4 pygpcca==1.0.2 scikit-learn==1.0.1 statsmodels==0.13.1 python-igraph==0.9.8 scvelo==0.2.4 pygam==0.8.0 matplotlib==3.5.0 seaborn==0.11.2

Yuanjian Huang · Answer 1 · Fri Dec 17 2021 13:49:01 GMT+0800 (China Standard Time)

I realize this issue doesn't depend on whether I used merged data or not.
I tired a simple dataset. Same issue occurs.

adata = sc.read('C:/Users/Park_Lab/Documents/ACT_sub2.h5ad')
sc.pl.umap(adata, color=['leiden'], legend_loc='right margin', frameon=False, title='', use_raw=False)
adata.var['chromosome']=adata.var['Chromosome']
adata.var['start']=adata.var['Start']
adata.var['end']=adata.var['End']
cnv.tl.infercnv(adata, n_jobs=16)

Gregor Sturm · Answer 2 · Fri Dec 17 2021 19:47:27 GMT+0800 (China Standard Time)

Did you annotate the genomic positions with io.genomic_position_from_gtf?

If that's the issue, I should probably implement a better error message.

Yuanjian Huang · Answer 3 · Sat Dec 18 2021 06:44:16 GMT+0800 (China Standard Time)

Hello @grst,
Thanks for the response.
Sorry, I didn't notice that we need to run the io.genomic_position_from_gtf in prior.
Because we're using the mouse scRNA-seq, I made the genomic position file by myself following the instruction of InferCNV https://github.com/broadinstitute/inferCNV/wiki/instructions-create-genome-position-file. The output is txt file. It look like below.

Then, I run cnv.io.genomic_position_from_gtf(adata, gtf_file='C:/Users/Park_Lab/Documents/mouse_genomic_position2.txt', gtf_gene_id='gene_name', inplace=True), but got errors

change txt to gtf, the same errors

Gregor Sturm · Answer 4 · Mon Dec 20 2021 15:55:35 GMT+0800 (China Standard Time)

There's no need to run that script from the inferCNV repository. You need to directly use a GTF file, for instance the Mouse M28 GENCODE.

Yuanjian Huang · Answer 5 · Mon Dec 20 2021 23:57:34 GMT+0800 (China Standard Time)

Hello @grst,
No matter I use the zipped or unzipped gencode.vM28.annotation.gtf file, I got the same errors telling that 'genomic_position_from_gtf() got multiple values for argument 'gtf_file''. Please check below.

Gregor Sturm · Answer 6 · Tue Dec 21 2021 15:39:28 GMT+0800 (China Standard Time)

I think you flipped the order of the gtf_file and adata arguments

Yuanjian Huang · Answer 7 · Tue Dec 21 2021 23:52:11 GMT+0800 (China Standard Time)

Hello @grst ,
Please check this. I changed the order of the gtf_file and adata. Still got errors.
Thanks!
Best
YJ

adata = sc.read('C:/Users/Park_Lab/Documents/ACT_sub2.h5ad')
adata
AnnData object with n_obs × n_vars = 2636 × 5000
    obs: 'leiden', 'n_genes', 'n_genes_by_counts', 'total_counts', 'total_counts_mt', 'pct_counts_mt', 'total_counts_rpl', 'pct_counts_rpl', 'total_counts_rps', 'pct_counts_rps'
    var: 'Accession', 'Chromosome', 'End', 'Start', 'Strand', 'n_cells', 'mt', 'rpl', 'rps', 'n_cells_by_counts', 'mean_counts', 'pct_dropout_by_counts', 'total_counts', 'highly_variable', 'means', 'dispersions', 'dispersions_norm', 'mean', 'std'
    uns: 'hvg', 'leiden', 'leiden_colors', 'neighbors', 'pca', 'rank_genes_groups', 'umap'
    obsm: 'X_pca', 'X_umap'
    varm: 'PCs'
    layers: 'ambiguous', 'matrix', 'spliced', 'unspliced'
    obsp: 'connectivities', 'distances'
cnv.io.genomic_position_from_gtf(gtf_file='C:/Users/Park_Lab/Documents/gencode.vM28.annotation.gtf', adata, gtf_gene_id='gene_name', inplace=True)
File "C:\Users\Park_Lab\AppData\Local\Temp/ipykernel_19252/1901209706.py", line 1
    cnv.io.genomic_position_from_gtf(gtf_file='C:/Users/Park_Lab/Documents/gencode.vM28.annotation.gtf', adata.var, gtf_gene_id='gene_name', inplace=True)
                                                                                                         ^
SyntaxError: positional argument follows keyword argument

Gregor Sturm · Answer 8 · Wed Dec 22 2021 16:06:52 GMT+0800 (China Standard Time)

Now we are getting at some basic Python language features that are independent of infercnvpy:
https://stackoverflow.com/questions/16932825/why-cant-non-default-arguments-follow-default-arguments

Yuanjian Huang · Answer 9 · Fri Dec 24 2021 07:22:43 GMT+0800 (China Standard Time)

Hello @grst,
Thanks for the reminding!
It works now with cnv.io.genomic_position_from_gtf(gtf_file='C:/Users/Park_Lab/Documents/gencode.vM28.annotation.gtf', adata=adata, gtf_gene_id='gene_name', inplace=True).
I sent you an email for confirming your address. We'll make an acknowledgment when we publish our data.
Thanks!
Best,
YJ

Faming Zhao · Answer 10 · Mon May 02 2022 16:57:06 GMT+0800 (China Standard Time)

Hello @grst, Thanks for the reminding! It works now with cnv.io.genomic_position_from_gtf(gtf_file='C:/Users/Park_Lab/Documents/gencode.vM28.annotation.gtf', adata=adata, gtf_gene_id='gene_name', inplace=True). I sent you an email for confirming your address. We'll make an acknowledgment when we publish our data. Thanks! Best, YJ

I find "adata = adata" is needed.

Stefan Peidli · Answer 11 · Wed Mar 15 2023 07:12:56 GMT+0800 (China Standard Time)

I had the same issue. I annotated the gene position myself since the built in method was very slow.

The problem arises when no values starting with "chr" exist in the chromosome column of adata.var. Problem was solved after adding that 'chr' to the values.

Maybe one should make infercnvpy allow for chromosome annotations like "1,2,3,...,X,Y" as well.

daniel-ranti · Answer 12 · Mon Apr 24 2023 21:57:43 GMT+0800 (China Standard Time)

I had the same issue. I annotated the gene position myself since the built in method was very slow.

The problem arises when no values starting with "chr" exist in the chromosome column of adata.var. Problem was solved after adding that 'chr' to the values.

Maybe one should make infercnvpy allow for chromosome annotations like "1,2,3,...,X,Y" as well.

seconding this! was very easy to fix, but discovering the issue took a minute...would be super helpful if the docstring specified that 'chr' is a necessary prefix to chromosome names :) thanks everyone, love the package