icbi-lab / infercnvpy

Infer copy number variation (CNV) from scRNA-seq data. Plays nicely with Scanpy.

Home Page:https://infercnvpy.readthedocs.io/en/latest/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

need at least one array to concatenate when doing cnv.tl.infercnv()

hyjforesight opened this issue · comments

Hello infercnvpy,
Thanks for developing this amazing package!
I have 2 datasets. One is normal cells--WI, the other is query cells--ACI. These 2 datasets have been cleaned and UMAPed by Scanpy. I merge them together without doing BBKNN integration. Then I call the cnv.tl.infercnv() function and use WI as reference, but it generates errors 'need at least one array to concatenate'.

Could you please help me to solve this issue?
Thanks!
Best,
YJ

Here is my coding and outputs:
ACI = sc.read('C:/Users/Park_Lab/Documents/ACI_sub2.h5ad')
ACI
AnnData object with n_obs × n_vars = 14497 × 5000
obs: 'leiden', 'n_genes', 'n_genes_by_counts', 'total_counts', 'total_counts_mt', 'pct_counts_mt', 'total_counts_rpl', 'pct_counts_rpl', 'total_counts_rps', 'pct_counts_rps'
var: 'Accession', 'Chromosome', 'End', 'Start', 'Strand', 'n_cells', 'mt', 'rpl', 'rps', 'n_cells_by_counts', 'mean_counts', 'pct_dropout_by_counts', 'total_counts', 'highly_variable', 'means', 'dispersions', 'dispersions_norm', 'mean', 'std'
uns: 'hvg', 'leiden', 'leiden_colors', 'neighbors', 'pca', 'rank_genes_groups', 'umap'
obsm: 'X_pca', 'X_umap'
varm: 'PCs'
layers: 'ambiguous', 'matrix', 'spliced', 'unspliced'
obsp: 'connectivities', 'distances'
WI = sc.read('C:/Users/Park_Lab/Documents/WI_sub.h5ad')
WI
AnnData object with n_obs × n_vars = 10681 × 5000
obs: 'leiden', 'n_genes', 'n_genes_by_counts', 'total_counts', 'total_counts_mt', 'pct_counts_mt', 'total_counts_rpl', 'pct_counts_rpl', 'total_counts_rps', 'pct_counts_rps'
var: 'Accession', 'Chromosome', 'End', 'Start', 'Strand', 'n_cells', 'mt', 'rpl', 'rps', 'n_cells_by_counts', 'mean_counts', 'pct_dropout_by_counts', 'total_counts', 'highly_variable', 'means', 'dispersions', 'dispersions_norm', 'mean', 'std'
uns: 'hvg', 'leiden', 'leiden_colors', 'neighbors', 'pca', 'rank_genes_groups', 'umap'
obsm: 'X_pca', 'X_umap'
varm: 'PCs'
layers: 'ambiguous', 'matrix', 'spliced', 'unspliced'
obsp: 'connectivities', 'distances'
adata = ACI.concatenate(WI, batch_categories=['ACI', 'WI'])
adata
AnnData object with n_obs × n_vars = 25178 × 3362
obs: 'leiden', 'n_genes', 'n_genes_by_counts', 'total_counts', 'total_counts_mt', 'pct_counts_mt', 'total_counts_rpl', 'pct_counts_rpl', 'total_counts_rps', 'pct_counts_rps', 'batch'
var: 'Accession', 'Chromosome', 'End', 'Start', 'Strand', 'mt', 'rpl', 'rps', 'highly_variable', 'n_cells-ACI', 'n_cells_by_counts-ACI', 'mean_counts-ACI', 'pct_dropout_by_counts-ACI', 'total_counts-ACI', 'means-ACI', 'dispersions-ACI', 'dispersions_norm-ACI', 'mean-ACI', 'std-ACI', 'n_cells-WI', 'n_cells_by_counts-WI', 'mean_counts-WI', 'pct_dropout_by_counts-WI', 'total_counts-WI', 'means-WI', 'dispersions-WI', 'dispersions_norm-WI', 'mean-WI', 'std-WI'
obsm: 'X_pca', 'X_umap'
layers: 'ambiguous', 'matrix', 'spliced', 'unspliced'
sc.pl.umap(adata, color=['batch', 'leiden'], legend_loc='right margin', frameon=False, title='', use_raw=False)
image
adata.var['chromosome']=adata.var['Chromosome']
adata.var['start']=adata.var['Start']
adata.var['end']=adata.var['End']
cnv.tl.infercnv(adata, reference_key='batch', reference_cat='WI', n_jobs=16)
image
image

package infor:
scanpy==1.8.2 anndata==0.7.8 umap==0.5.2 numpy==1.20.3 scipy==1.7.3 pandas==1.3.4 scikit-learn==1.0.1 statsmodels==0.13.1 python-igraph==0.9.8 pynndescent==0.5.5
scvelo==0.2.4 scanpy==1.8.2 anndata==0.7.8 loompy==3.0.6 numpy==1.20.3 scipy==1.7.3 matplotlib==3.5.0 sklearn==1.0.1 pandas==1.3.4
cellrank==1.5.0 scanpy==1.8.2 anndata==0.7.8 numpy==1.20.3 numba==0.54.1 scipy==1.7.3 pandas==1.3.4 pygpcca==1.0.2 scikit-learn==1.0.1 statsmodels==0.13.1 python-igraph==0.9.8 scvelo==0.2.4 pygam==0.8.0 matplotlib==3.5.0 seaborn==0.11.2

I realize this issue doesn't depend on whether I used merged data or not.
I tired a simple dataset. Same issue occurs.

adata = sc.read('C:/Users/Park_Lab/Documents/ACT_sub2.h5ad')
sc.pl.umap(adata, color=['leiden'], legend_loc='right margin', frameon=False, title='', use_raw=False)
adata.var['chromosome']=adata.var['Chromosome']
adata.var['start']=adata.var['Start']
adata.var['end']=adata.var['End']
cnv.tl.infercnv(adata, n_jobs=16)

image

Did you annotate the genomic positions with io.genomic_position_from_gtf?

If that's the issue, I should probably implement a better error message.

Hello @grst,
Thanks for the response.
Sorry, I didn't notice that we need to run the io.genomic_position_from_gtf in prior.
Because we're using the mouse scRNA-seq, I made the genomic position file by myself following the instruction of InferCNV https://github.com/broadinstitute/inferCNV/wiki/instructions-create-genome-position-file. The output is txt file. It look like below.
image
Then, I run cnv.io.genomic_position_from_gtf(adata, gtf_file='C:/Users/Park_Lab/Documents/mouse_genomic_position2.txt', gtf_gene_id='gene_name', inplace=True), but got errors
image
change txt to gtf, the same errors
image

There's no need to run that script from the inferCNV repository. You need to directly use a GTF file, for instance the Mouse M28 GENCODE.

Hello @grst,
No matter I use the zipped or unzipped gencode.vM28.annotation.gtf file, I got the same errors telling that 'genomic_position_from_gtf() got multiple values for argument 'gtf_file''. Please check below.
image
image

I think you flipped the order of the gtf_file and adata arguments

Hello @grst ,
Please check this. I changed the order of the gtf_file and adata. Still got errors.
Thanks!
Best
YJ

adata = sc.read('C:/Users/Park_Lab/Documents/ACT_sub2.h5ad')
adata
AnnData object with n_obs × n_vars = 2636 × 5000
    obs: 'leiden', 'n_genes', 'n_genes_by_counts', 'total_counts', 'total_counts_mt', 'pct_counts_mt', 'total_counts_rpl', 'pct_counts_rpl', 'total_counts_rps', 'pct_counts_rps'
    var: 'Accession', 'Chromosome', 'End', 'Start', 'Strand', 'n_cells', 'mt', 'rpl', 'rps', 'n_cells_by_counts', 'mean_counts', 'pct_dropout_by_counts', 'total_counts', 'highly_variable', 'means', 'dispersions', 'dispersions_norm', 'mean', 'std'
    uns: 'hvg', 'leiden', 'leiden_colors', 'neighbors', 'pca', 'rank_genes_groups', 'umap'
    obsm: 'X_pca', 'X_umap'
    varm: 'PCs'
    layers: 'ambiguous', 'matrix', 'spliced', 'unspliced'
    obsp: 'connectivities', 'distances'
cnv.io.genomic_position_from_gtf(gtf_file='C:/Users/Park_Lab/Documents/gencode.vM28.annotation.gtf', adata, gtf_gene_id='gene_name', inplace=True)
File "C:\Users\Park_Lab\AppData\Local\Temp/ipykernel_19252/1901209706.py", line 1
    cnv.io.genomic_position_from_gtf(gtf_file='C:/Users/Park_Lab/Documents/gencode.vM28.annotation.gtf', adata.var, gtf_gene_id='gene_name', inplace=True)
                                                                                                         ^
SyntaxError: positional argument follows keyword argument

Now we are getting at some basic Python language features that are independent of infercnvpy:
https://stackoverflow.com/questions/16932825/why-cant-non-default-arguments-follow-default-arguments

Hello @grst,
Thanks for the reminding!
It works now with cnv.io.genomic_position_from_gtf(gtf_file='C:/Users/Park_Lab/Documents/gencode.vM28.annotation.gtf', adata=adata, gtf_gene_id='gene_name', inplace=True).
I sent you an email for confirming your address. We'll make an acknowledgment when we publish our data.
Thanks!
Best,
YJ

Hello @grst, Thanks for the reminding! It works now with cnv.io.genomic_position_from_gtf(gtf_file='C:/Users/Park_Lab/Documents/gencode.vM28.annotation.gtf', adata=adata, gtf_gene_id='gene_name', inplace=True). I sent you an email for confirming your address. We'll make an acknowledgment when we publish our data. Thanks! Best, YJ

I find "adata = adata" is needed.

I had the same issue. I annotated the gene position myself since the built in method was very slow.

The problem arises when no values starting with "chr" exist in the chromosome column of adata.var. Problem was solved after adding that 'chr' to the values.

Maybe one should make infercnvpy allow for chromosome annotations like "1,2,3,...,X,Y" as well.

I had the same issue. I annotated the gene position myself since the built in method was very slow.

The problem arises when no values starting with "chr" exist in the chromosome column of adata.var. Problem was solved after adding that 'chr' to the values.

Maybe one should make infercnvpy allow for chromosome annotations like "1,2,3,...,X,Y" as well.

seconding this! was very easy to fix, but discovering the issue took a minute...would be super helpful if the docstring specified that 'chr' is a necessary prefix to chromosome names :) thanks everyone, love the package