Introduction

Big Data Tools (BDT) is a computational framework for analyzing very large scale genomic data. Currently, it offers several unique tools.

BDVD：Big Data Variance Decomposition for High-throughput Genomic Data

Variance decomposition (e.g., ANOVA, PCA) is a fundamental tool in statistics to understand data structure. High-throughput genomic data have heterogeneous sources of variation. Some are of biological interest, and others are unwanted (e.g., lab and batch effects). Knowing the relative contribution of each source to the total data variance is crucial for making data-driven discoveries. However, when one has massive amounts of high-dimensional data with heterogeneous origins, analyzing variances is non-trivial. The dimension, size and heterogeneity of the data all pose significant challenges. Big Data Variance Decomposition (BDVD) is a new tool developed to solve this problem. Built upon the recently developed RUV approach, BDVD decomposes data into biological signals, unwanted systematic variation, and independent random noise. The biological signals can then be further decomposed to study variations among genomic loci or sample types, or correlation between different data types. The algorithm is implemented by incorporating techniques to handle big data and offers several unique features:

Implemented with efficient C++ language
Fully exploits multi-core/multi-cpu computation power
Ability to handle very large scale data (e.g., a 30,000,000 × 500 data matrix)
Ability to directly take a large number of BAM files as input with multi-core parallel processing
Provides command line tools
Provides R package to run BDVD and for seamless integration
Transparency/open-source code
Easy installation - one liner command, no root user required

In addition, BDVD naturally outputs normalized biological variations for downstream statistical inferences such as clustering large scale genomic loci with BigClust that is also provided in BDT.

BigClust: Big Data Clustering Methods

Cluster analysis is the task of assigning a set of elements into groups (clusters) on the basis of their similarity. BigClust offers several tools to quickly perform clustering for very large scale dataset.

BigKmeans

BigKmeans enhences the widely used K-means with important improvments making it very suitable for big data.

Improved the seeding (choosing initial centroids) with kmeans++
Ability to evaluate optimal K with no extra computational cost
Implemented with efficient C++ language
Fully exploits multi-core/multi-cpu computation power
Ability to handle very large scale data (e.g., a 30,000,000 × 500 data matrix)
Built-in ability to exploit multi-machine resources with distriubted computing for super large dataset
Provides command line tools
Provides R package to run BigKmeans and for seamless integration
Transparency/open-source code
Easy installation - one liner command, no root user required

Installation

Platforms

BDT runs on the following platforms:

Linux
Mac OS X
Windows

Installation on Linux

Download the latest source code: v0.1.4.tar.gz

Extract and go to the extracted directory:

 $ tar xfz v0.1.4.tar.gz
 $ cd BDT-v0.1.4

Build and install BDT:
```
 $ make bdt_home={install_path}
```

where {install_path} is an installation directory (has to be an absolute path). The directory will be created if it does not exist.

Installation on Mac OS X

Ensure that the Xcode Command Line Tools is installed. Otherwise open the Terminal and type:
```
 $ xcode-select --install
```

A pop-up windows will appears asking you about install tools. 2. Download the latest source code: v0.1.4.tar.gz 3. Extract and go to the extracted directory:

    $ tar xfz v0.1.4.tar.gz
    $ cd BDT-v0.1.4

Build and install BDT:
```
 $ make bdt_home={install_path}
```

where {install_path} is an installation directory (has to be an absolute path). The directory will be created if it does not exist.

Installation on Windows

Ensure that the Python3.3.3 (64-bit) is installed. Otherwise download Windows X86-64 MSI Installer (3.3.3) and install it.
Ensure that the Visual C++ Redistributable Packages for Visual Studio 2013 is installed. Otherwise download vcredist_x64.exe and install it.
Download BDT executable zip BDT-v0.1.4-win64.zip.
Extract it and all the required executables/scripts will be in the extracted directory.

R package

The bdt R package is to run BDT within R for seamless integration. Under the hood, it simply calls BDT commond line tools and provides a convinient way to retrieve the output results from BDT into R for follow-up analysis.

Have BDT installed (see above sections)
Have R installed
Ensure that the devtools package is installed. Otherwise,
```
 install.packages('devtools')
```

Install bdt library:

 library(devtools)
 install_git('https://github.com/fangdu64/rpackages', subdir = 'bdt')

Use bdt:
```
 library(bdt)
```

Example usages can be found in R examples and analysis

Usage

Authors

BDVD: Fang Du, Ben Sherwood, Bing He, Hongkai Ji

BigClust: Fang Du, Ben Sherwood, Hongkai Ji

About

Languages

Language:C 28.7%Language:C++ 24.3%Language:Fortran 21.5%Language:Python 20.3%Language:HTML 1.7%Language:Shell 1.0%Language:Makefile 0.6%Language:Assembly 0.5%Language:M4 0.3%Language:TeX 0.3%Language:Perl 6 0.2%Language:R 0.2%Language:CMake 0.1%Language:Groff 0.1%Language:Yacc 0.0%Language:DTrace 0.0%Language:Objective-C 0.0%Language:Tcl 0.0%Language:Terra 0.0%Language:Batchfile 0.0%Language:Common Lisp 0.0%Language:DIGITAL Command Language 0.0%Language:PLSQL 0.0%Language:Awk 0.0%Language:Inno Setup 0.0%Language:Perl 0.0%Language:XSLT 0.0%Language:Lex 0.0%Language:Vim Script 0.0%Language:XQuery 0.0%Language:Java 0.0%Language:CSS 0.0%Language:PowerShell 0.0%Language:C# 0.0%Language:Prolog 0.0%Language:Visual Basic 0.0%