如何整合多个单细胞数据集

Question

如何整合多个单细胞数据集

ixxmu opened this issue 3 months ago · comments

ixxmu commented 3 months ago

https://mp.weixin.qq.com/s/zt0BzAmqzsxNqhPnDky0gA

ixxmu · Answer 1 · Fri Feb 23 2024 10:43:38 GMT+0800 (China Standard Time)

如何整合多个单细胞数据集 by 生信技能树

学员表示他在处理这个数据集（GSE152938）的时候，因为数据集里面是5个样品，但是只有一个是正常组织的样品，分组是不平衡的，所以需要联合其它数据集的正常组织，但是不知道如何在r编程语言里面操作。

数据集（GSE152938）

如下所示的数据集（GSE152938）文件形式：

文件形式

对于这个数据集（GSE152938），可以使用下面的代码进行批量读取哈：

dir='GSE152938_RAW'
samples=list.files( dir )
samples 
sceList = lapply(samples,function(pro){ 
  # pro=samples[1] 
  print(pro)  
  tmp = Read10X(file.path(dir,pro )) 
  if(length(tmp)==2){
    ct = tmp[[1]] 
  }else{ct = tmp}
  sce =CreateSeuratObject(counts =  ct ,
                          project =  pro  ,
                          min.cells = 5,
                          min.features = 300 )
  return(sce)
}) 
do.call(rbind,lapply(sceList, dim))
sce.all=merge(x=sceList[[1]],
              y=sceList[ -1 ],
              add.cell.ids = samples  ) 
names(sce.all@assays$RNA@layers)
sce.all[["RNA"]]$counts 
# Alternate accessor function with the same result
LayerData(sce.all, assay = "RNA", layer = "counts")
sce.all <- JoinLayers(sce.all)
dim(sce.all[["RNA"]]$counts )

学员给了一个正常人的肾脏组织数据集（GSE131685），同时我们读取它，如下所示的文件形式：

数据集（GSE131685）

也是同样的读取方式：

dir='GSE131685_RAW/outputs/'
samples=list.files( dir )
samples

上面的这两个数据集走我们给大家的标准代码后各自独立的降维聚类分群，就会有 2-harmony/sce.all_int.rds 文件夹和文件。

然后就可以使用下面的代码，合并两个数据集：

GSE131685 = readRDS('../2020-GSE131685-3个正常人的肾单细胞/2-harmony/sce.all_int.rds') 
GSE131685$study = 'GSE131685'
table(GSE131685$orig.ident)
GSE152938 = readRDS('../2021-GSE152938-肾癌/2-harmony/sce.all_int.rds')
GSE152938$study = 'GSE152938'
table(GSE152938$orig.ident)
sceList = list(
  GSE131685 = CreateSeuratObject(
    counts = GSE131685@assays$RNA$counts
  ), 
  GSE152938 = CreateSeuratObject(
    counts = GSE152938@assays$RNA$counts
  )
)
sce.all=merge(x=sceList[[1]],
              y=sceList[ -1 ],
              add.cell.ids = c('GSE131685','GSE152938')  ) 
names(sce.all@assays$RNA@layers)
sce.all[["RNA"]]$counts 
# Alternate accessor function with the same result
LayerData(sce.all, assay = "RNA", layer = "counts")
sce.all <- JoinLayers(sce.all)
dim(sce.all[["RNA"]]$counts )

可以看到这两个数据集都有不同数量的样品，合并后就是一个大的对象了：

> table(GSE131685$orig.ident)

kidney1 kidney2 kidney3 
   6856    5134   10532 
   
> table(GSE152938$orig.ident)

ccRCC1 ccRCC2  chRCC Normal   pRCC 
  9478   9317   5187   1395  11395 
  
> table(sce.all$orig.ident) 

GSM4145204 GSM4145205 GSM4145206 GSM4630027 GSM4630028 GSM4630029 GSM4630030 GSM4630031 
      6856       5134      10532      11395       9478       9317       5187       1395

这个合并后的对象当然是可以继续走我们给大家的标准代码后各自独立的降维聚类分群啦。(

链接: https://pan.baidu.com/s/1pKEnPmWXi-pTab0WZUWzgg?pwd=a7s1

)

所以，理论上使用这个技巧是可以处理无限多个不同来源的单细胞转录组数据集啦，而且无需担心大家的文件格式的问题，反正每个数据集都自己的内部处理好，然后外部构建成为sceList合并即可。

文末友情宣传

强烈建议你推荐给身边的博士后以及年轻生物学PI，多一点数据认知，让他们的科研上一个台阶：

ixxmu · Answer 2 · Fri Feb 23 2024 10:44:03 GMT+0800 (China Standard Time)

ixxmu commented 3 months ago