xiahui625649 / MingPCACluster

EM高斯混合模型聚类

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

MingPCACluster

A new simple and efficient software to PCA and Cluster For popolation VCF File or STOmics gem File

1 Introduction

MingPCACluster 是于基于VCF开发的PCA分析和聚类软件,同时兼并了Genotype 等格式软件,同时开发针对时空单细胞表达量的格式(xx.gem.gz)文件(beta功能)。 即只要对应的一个输入文件进来,这PCA和作图分组等一位到位。

keyword : VCF2PCA ;     VCF2Kinship ;     cluster;     k-means ;     cellbin ;    STOmics


亮点:
1 The result is the same with tassel,gapit and gcta , just the difference in precision.
2 功能有 1 多种kinship矩阵 2 PCA结果 3 聚类结果 和4 以cluster染色并作图。
3 一个VCF输入,一步到位,方便用户使用.
4 边读边算,内存剥离受位点多少的影响(时空组是剥离受基因数量多少的影响),内存只受样品量影响,故上100k的样品当也行,在这个基础上开发时空细胞PCA和聚类,虽然时空组学上主要是样品多。(80K 60G内存)
5 Kmean聚类和DBSCAN聚类。 Keman分析,并找出最佳K值,和Structure和K值一样. 作图以此染色。
6 提作作图小脚本,可以用这个脚本优化作图等。



程序是给一些有基础的生信朋友用的,若是小白看不懂就算了。

MingPCACluster MingPCAC is a PCA analysis software format developed based on VCF. It also incorporates Genotype, etc., and develops a file (beta function) for the expression of spatiotemporal cells. That is, as long as the input is satisfied, the PCA and the cluster group are of the same output.

2 Download and Install


The new version will be updated and maintained in hewm2008/MingPCACluster, please click below website to download the latest version

hewm2008/MingPCACluster

2.1. linux/MaxOS    Download


2.2 Pre-install
MingPCACluster is for Linux/Unix/macOS only.
Before installing,please make sure the following pre-requirements are ready to use.
1) convert command is recommended to be pre-installed, although it is not required
2) g++ : g++ with --std=c++11 > 4.8+ is recommended ?
3) zlib : zlib > 1.2.3 is recommended ?
4) R : R with ggplot is recommended


2.3 Install
Users can install it with the following options:
Option 1:

        git clone https://github.com/hewm2008/MingPCACluster.git
        cd MingPCACluster;	chmod 755 -R bin/*
        ./bin/MingPCACluster  -h 

3 Parameter description



3.1 MingPCACluster
3.1.1 Main parameter

	Usage: MingPCACluster  -InVCF  <in.vcf.gz>  -OutPut <outPrefix>

		-InVCF         <str>      Input SNP VCF Format
		-InGenotype    <str>      InPut Genotype File
		-InKinship     <str>      Input SNP K Kinship File Format
		-OutPut        <str>      OutPut File Prefix(Kinship PCA etc)

		-KinshipMethod <int>      Method of Kinship [1-4],defaut [1]
		                          1:BaldingNicolsKinship(VanRaden/Normalized_IBS)
		                          2:IBSKinshipImpute 3:IBSKinship 4:p_dis
		-ClusterMethod <str>      Method For Cluster[EM/Kmean/DBSCAN] [EM]

		-help                     Show more Parameters and help [hewm2008]


brief description for function:

	   #   用法一看即明,最基础的为 一个输入和输出即可 
     #       输入文件基因组格式见  pdf.主要为VCF和gem文件
	   #    更多说明后面将在知乎更新

	    MingPCACluster	-InSTOgem	Test.gem.gz	-OutPut	Test	-CellBin	100

          ### run without pop.info
          #   MingPCACluster	-InVCF	Khuman.vcf.gz	-OutPut	OUT
          ### run with  pop.info
        MingPCACluster	-InVCF	Khuman.vcf.gz	-OutPut	OUT	-InSampleGroup	pop.info 


3.1.2 Detail parameters

	Usage: MingPCACluster  -InVCF  <in.vcf.gz>  -OutPut <outPrefix>

		-InVCF        <str>      Input SNP VCF Format
		-InGenotype   <str>      InPut Genotype File
		-InKinship    <str>      Input SNP K Kinship File Format
		-OutPut       <str>      OutPut File Prefix(Kinship PCA etc)

		-KinshipMethod <int>      Method of Kinship [1-4],defaut [1]
		                          1:BaldingNicolsKinship(VanRaden/Normalized_IBS)
		                          2:IBSKinshipImpute 3:IBSKinship 4:p_dis
		-ClusterMethod <str>      Method For Cluster[Kmean/DBSCAN] [Kmean]

		-help                    Show more Parameters and help [hewm2008]

		-InSampleGroup <string>   In File of sampleGroup info,format(sample groupA)


		-MAF           <float>    Min minor allele frequency filter [0.001]
		-Fchr          <str>      Filter the chrX chr[chrX,chrY,X,Y]
		-Miss          <float>    Max ratio of miss allele filter [0.25]
		-Het           <float>    Max ratio of het allele filter [1.00]
		-HWE           <float>    Exact test of Hardy-Weinberg Equilibrium for SNP Pvalue[0]
		-SubPop        <str>      Sub Sample File List to PCA[ALLsample]
		-KeepRemainVCF            keep the VCF after filter

		-RandomCenter             Random diff-center to Re-Run Cluster for Kmean
		-BestKRatio    <float>    Get the best K Cluster by deta-SSE Ratio[0.15]
		-MaxCluNum     <int>      Max Cluster Num to find Best K [12]
		-MinPointNum   <int>      Minimum point number of D-cluster[4]
		-Epsilon       <float>    Epsilon for DBSCAN_Distance/EM_convergence (auto)
		-Iterations    <int>      Iterations number for EM clustering[500]


		-InSTOgem      <str>      InPut STOmics gem File of MIDCounts(beta)
		-CellBin       <int>      STOmics cell bin[50]
		-STOName       <string>   STOmics Sample Name STOName

		-PCANum        <int>      Num of PCA eig [10]


3.2.2 Other parameters
程序也提供了作图软件perl 作图脚本(这个脚本后面将会优化更动较大,主要是最近时间较忙),作图脚本的简要参数说明如下:

ploteig  -h

	Version:1.16         hewm2008@gmail.com

	Usage: ploteig  -InPCA  pca.eigenvec -OutPrefix Fig


		Options

		-InPCA        <s>   : InPut File of PCA
		-OutPrefix    <s>   : OutPut file prefix

		-help               : Show more help [hewm2008]

		-columns      <s>   : the columns to plot a:b [3:4]
		-ColShap            : colour <=> shape for cluster or subpop
		-border       <i>   : how to plot the border (1,2,4,8,3,31) [3]
		-title        <s>   : title (legend) [PCA]
		-keystyle     <s>   : put key at top right  default(in) [outside]box [outside]
		-pointsize    <i>   : point size for plot [3]

		-BinDir       <s>   : The Bin Dir of gnuplot/R/ps2pdf/convert [$PATH]


3.3 Output files

Module outFlie Description
List
out.kinship 输出的亲缘矩阵,各样品的两两关系
out.eigenvec 输出最优聚类和pca结果
out.eigenval 输出pca结果的特征向量
out.PCA1_PCA2.pdf 输出按cluster染色后的pca 1 2图
out.K.pdf 输出cluster K图 (SSE BDi)
out.cluster 输出的各种K的cluster聚类结果
Out.cellbin.gz 输出bin50 cell的结果,若是 -InSTOgem
Out.cluster pdf/png 输出坐标cluester图,若是-InSTOgem

示例图见上面应用场景给的图。示例图和格式当一看即明,相关图可以见example 1 和2

4 Example



See more detailed usage in the             Chinese Documentation
See more detailed usage in the             English Documentation
See the example directory and Manual.pdf for more detail.
具体见这儿 Manual.pdf for more detail 里面的示意数据和脚本,后期将在某些网址释放一些教程

../../bin/MingPCACluster -InVCF in.vcf.gz -OutPut outPrefix
目录 Example/example*/ 里面有输入和输出和脚本用法。

  • Example 1)千人VCF重测序SNP基因型
    共从K 人数据chr22 dbsnp里面随机挑出了1194个位点,挑 CEU(49) , CHB(46) , JPT(56)和 YRI (52)共203 个样品来分析。
    聚类走势,best K
    K_SSE.png
    EM Gaussian PCA结果
    PCA.png
    Kmean PCA结果
    PCA.png
    DBSCAN PCA结果
    PCA.png

  • Example 2) cellbin时空细胞表达量pca和聚类


时空分析我初了解主要是:seurat ,我很浅淡的了解,这个包用到的nm (n是样品,m是位点)的稀疏矩阵,好像周边的做时空的人总说内存很大,我这没有对时空数据敏感,对表达量进行了取log10. 也用了稀疏矩阵 和 nn, 由于时空n是样品量很大,怕难下降。
初以 我这用了文件大于(File.gem.gz : 177M ), 范围: XXmin: 4975 XXmax: 23374 YYmin: 2525 YYmax: 20724 )。取bin 50, n达到的88507,即主要88507*88507的矩阵double上,占用60.742G (稀疏矩阵5G 矩阵:55G) 。


PCA K Fig
out1.png
PCA plot Fig
out2.png
STOmics Cluster plot Fig
out3.png

5 Advantages


速度快,少内存 fast speed, low memory
简明易用 Simple and easy to use
免安装 Free installation

6 Discussing


######################swimming in the sky and flying in the sea #############################

About

EM高斯混合模型聚类


Languages

Language:C++ 71.1%Language:Perl 26.6%Language:Shell 2.4%