spidermanJie/express-hadoop

#Introduction# EXPRESS is proposed to enable efficient processing of high-dimensional scientific data. It takes advantage of prior knowledge in data structure and data usage pattern. By performing incongruent data partitioning and locality aware task scheduling, EXPRESS effectively reduces the network traffic and task execution time. Highlighted features involve:

User interfaces to describe and use data in a structure-aware language.
A novel incongruent data partitioning scheme for replicas. EXPRESS supports the coexistence of multiple data partitioning for the same data. It also introduces a set of optimizations to fully realize the potential of incongruent partitioning.
Data layout aware task selection and scheduling. By exposing data layout information, the EXPRESS scheduler collocates the map/reduce tasks with related data. When data layout matches its usage pattern, EXPRESS can select the proper map task to accelerate the data loading.

#Installation# ##Prerequirement
Packages hadoop-1.0.1, ant, patch

Environmental Variables HADOOP_HOME

##Steps

Apply express patch to hadoop-1.0.1

cd $HADOOP_HOME && patch -p0 < $EXPRESS_HOME/express-hadoop-1.0.1.patch
Create express-hadoop.jar

ant -f build.xml jar
Edit ${EXPRESS_ROOT}/test/env.sh to set the path of Hadoop and Express
Recompile hadoop-1.0.1

ant -f build.xml compile

#How To#

Use express.hdd.HDFGen to generate test data with specifc partitioning scheme

bin/hadoop jar express-hadoop.jar hdf.test.HDFGen [dataSize] [partitionOffset] [recordSize] [partitionSize] [outDir]
Use express.hdd.HDFMicroBenchmark to load data with specific pattern

bin/hadoop jar express-hadoop.jar hdf.test.HDFMicroBenchmark [dataSize] [chunkOffset] [chunkSize] [inDir] [outDir]
run tests/validate.sh for validation

#A Motivating Case# Hyperspectral data is usually collected by sensors on an airborne or spaceborne platform. It is a valuable data source for many critical applications, such as mineral exploration, agricultural assessment, and special target recognition. Figure below shows a representative image of a hyperspectral cube.

Figure 1 Graphic representation of hyperspectral data

The image consists of two spatial dimensions and one spectral dimension. Terabytes of such data have been produced daily by EOS satellites since 1997. The accumulation of global hyperspectral datasets now reaches the petabytes scale.

To analyze the data for a special purpose like geometric correction or mineral searching, the data needs to be partitioned regularly as the top cube shown in below Figure (a). The partitions then can be processed independently. MapReduce seems the proper solution at first, but two issues are readily apparent:

Figure 2 Data Usage and Storage Partitioning

In traditional MapReduce, the data partition and distribution are not directly controlled by the user. So when the data usage pattern is illustrated by the top cube in Figure 2(b), the data may be actually partitioned as cubes in Figure 2(a).
Various usage patterns (partitioning) could be applied to the same chunk of data, depending on the analysis being performed. For instance, change detection tasks require broad spatial regions, and several adjacent spectral layers; signal processing tasks have no spatial region requirement, but a partition needs to contain all the spectral layers for one pixel. Figure 2(b) gives three possible usage patterns.

The storage-usage mismatch in Figure 2(a) and Figure 2(b) causes extra network traffic and synchronization. Figure 2(d) shows that in order to collect the red chunk of data for processing, nine blocks are accessed. Since data blocks are distributed over all nodes in the system, network latency variance and maximum bandwidth limitations could greatly slow down this data access. Due to the absence of data locality, the scalability of the map task stage degrades enormously in the scenario represented by Figure 2(d). When storage matches the data usage as described in Figure 2(e), data locality is preserved and the system becomes scalable again.

#Features# Incongruent Partition enables different partition for each replica of the same data.

Locality Aware Reducer Scheduling takes into account the data produced by the mapper and its locality. Task scheduler therefore makes decision to minimize the data movement between mappers and reducers over network.

PipeFile Unlike pipelines in Hadoop which enable data streaming to external local program, pipeFile is a powerful solution to connect two or more MapReduce jobs. It borrows the idea from Unix named pipe, while apply it into a distributed system.

#People#

Developer: Siyuan Ma
Faculty: Xian-He Sun, Robert Ross

spidermanJie / express-hadoop

About