A genome binning method for contig binning, based on semi-supervised spectral clustering method.
We recommend using conda to run SolidBin. Download here
After installing Anaconda (or miniconda), fisrt obtain SolidBin:
git clone https://github.com/sufforest/SolidBin
Then simply create a solidbin environment
conda env create -f environment.yml
source activate solidbin
We also provide our docker image. If you are more familiar with docker, you can just get our image by:
docker pull sufforest/solidbin
A simple way to use our image is just mount you data directory and run:
docker run -it -v /data/StrainMock/input:/input -v /data/StrainMock/output:/output solidbin python SolidBin.py --contig_file /input/StrainMock_Contigs_cutup_10K_nodup_filter_1K.fasta --composition_profiles /input/kmer_4.csv --coverage_profiles /input/cov_inputtableR.tsv --output /output/result.tsv --log /output/log.txt
Suppose that /data/StrainMock contains the data in your machine, this command mount two directories into the container so that our SolidBin can use them.
If you do not have composition or coverage profiles, you can just enter the container and generate them by yourself.
docker run -it -v /data/StrainMock/input:/input -v /data/StrainMock/output:/output solidbin sh
The preprocessing steps aim to generate coverage profile and composition profile as input to our program.
There are several methods that can generate these two types of information and we provide one of them below.
Composition profile is the vector representation of contigs and we use kmer to generate this information.
$ python scripts/gen_kmer.py /path/to/data/contig.fasta 4 /path/to/input/composition.csv
Here we choose k=4. By default we only keep contigs longer than 1000, you can specify a different number using -t.
For the coverage profile we use minimap, since it can address both short read and long read samples.
If you use SolidBin in docker, all dependencies and environment variables have been configured correctly, you can just mount you input directory into /input in docker, then slightly modify gen_cov.sh and run it.
You input directory should look like this:
.
+-- assembly.fasta
+-- sr
| +-- short_read_sample_1
| +-- short_read_sample_2
+-- pb
| +-- pacbio_sample_1
| +-- pacbio_sample_2
| +-- pacbio_sample_3
For conda environment , you should check whether perl is installed.
- Usage: [--contig_file CONTIG_FILE] [--coverage_profiles COVERAGE_PROFILES] [--composition_profiles COMPOSITION_PROFILES] [--priori_ml_list ML_LIST] [--priori_cl_list CL_LIST] [--output OUTPUT] [--log LOG_LOCATION] [--clusters CLUSTERS] [-a ALPHA] [-b BETA] [--use_sfs]
- arguments
--contig_file CONTIG_FILE:
The contigs file.
--coverage_profiles COVERAGE_PROFILES:
The coverage_profiles, containing a table where each
row correspond to a contig, and each column correspond
to a sample. All values are separated with tabs.
--composition_profiles COMPOSITION_PROFILES:
The composition profiles, containing a table where
each row correspond to a contig, and each column
correspond to the kmer composition of particular kmer.
All values are separated with comma.
--output OUTPUT:
The output file, storing the binning result.
- optional
--priori_ml_list ML_LIST:
The list of contig pairs under must-link constraints, one row for one constraint in
the format: contig_1,contig_2,weight.
--priori_cl_list CL_LIST:
The list of contig pairs under cannot-link constraints, one row for one constraint in
the format: contig_1,contig_2,weight.
--log LOG_LOCATION:
The log file, storing the binning process and parameters.
--clusters CLUSTERS:
Specify the number of clusters. If not specified, the
cluster number is estimated by single-copy genes.
-a ALPHA:
Specify the parameter of must-link constraints, if not specified,
alpha is estimated by an one-dimensional search.
-b BETA:
Specify the parameter of cannot-link constraints, if not specified,
beta is estimated by an one-dimensional search.
--use_sfs:
Use sequence feature similarity to generate must-link constraints.
- different SolidBin modes
SolidBin mode | Usage |
---|---|
SolidBin-naive | --contig_file --coverage_profiles --composition_profiles --output --log |
SolidBin-SFS | --contig_file --coverage_profiles --composition_profiles --output --log --use_sfs |
SolidBin-coalign | --contig_file --coverage_profiles --composition_profiles --output --log --priori_ml_list |
SolidBin-CL | --contig_file --coverage_profiles --composition_profiles --output --log --priori_cl_list |
SolidBin-SFS-CL | --contig_file --coverage_profiles --composition_profiles --output --log --priori_cl_list --use_sfs |
We also provide MATLAB version code for the reproduction of the results in our paper "SolidBin: Improving Metagenome Binning with Semi-supervised Normalized Cut".
We have wrapped our MATLAB code in Python and you can use it in the same way as Python (Please copy "auxiliary" into "matlab_ver" before you use MATLAB-ver SolidBin).
NOTICE: We only test this wrapper on MATLAB R2017a and we can not guarantee the stabibility on other versions of MATLAB.
Please send bug reports or questions (such as the appropriate modes for your datasets) to Ziye Wang: zwang17@fudan.edu.cn and Dr. Shanfeng Zhu: zhusf@fudan.edu.cn
[1] Lu, Yang Young, et al. "COCACOLA: binning metagenomic contigs using sequence COmposition, read CoverAge, CO-alignment and paired-end read LinkAge." Bioinformatics 33.6 (2017): 791-798.
[2] Alneberg, Johannes, et al. "Binning metagenomic contigs by coverage and composition." Nature methods 11.11 (2014): 1144.
Wang Z., et al. "SolidBin: Improving Metagenome Binning with Semi-supervised Normalized Cut." Bioinformatics. 2019 Apr 12. pii: btz253. doi: 10.1093/bioinformatics/btz253.
Note: If you cite SolidBin in your paper, please specify the mode of SolidBin you used in your paper to avoid confusion. Thank you.