Seismic Hadoop combines Seismic Unix with Cloudera's Distribution including Apache Hadoop to make it easy to execute common seismic data processing tasks on a Hadoop cluster.
You will need to install Seismic Unix on both your client machine and the servers in your Hadoop cluster.
In order to create the jar file that coordinates job execution, simply run mvn package
.
This will create a seismic-0.1.0-job.jar
file in the target/
directory, which includes all of the necessary
dependencies for running a Seismic Unix job on a Hadoop cluster.
The suhdp
script in the bin/
directory may be used as a shortcut for running the following commands. It requires that
the HADOOP_HOME
environment variable is set on the client machine.
The load
command to suhdp will take SEG-Y or SU formatted files on the local machine, format them for use with Hadoop,
and copy them to the Hadoop cluster.
suhdp load -input <local SEG-Y/SU files> -output <HDFS target> [-cwproot <path>]
The cwproot
argument only needs to be specified if the CWPROOT environment variable is not set on the client machine.
Seismic Hadoop will use the segyread
command to parse a local file unless it ends with ".su".
The unload
command will read Hadoop-formatted data files from the Hadoop cluster and write them to the local machine.
suhdp unload -input <SU file/directory of files on HDFS> -output <local file to write>
The run
command will execute a series of Seismic Unix commands on data stored in HDFS by converting the commands
to a series of MapReduce jobs.
suhdp run -command "seismic | unix | commands" -input <HDFS input path> -output <HDFS output path> \
-cwproot <path to SU on the cluster machines>
For example, we might run:
suhdp run -command "sufilter f=10,20,30,40 | suchw key1=gx,cdp key2=offset,gx key3=sx,sx b=1,1 c=1,1 d=1,2 | susort cdp gx" \
-input aniso.su -output sorted.su -cwproot /usr/local/su
In this case, Seismic Hadoop will run a MapReduce job that applies the sufilter
and suchw
commands to each trace during the Map
phase, and then sorts the data by the CDP field in the trace header during the Shuffle phase, and then performs a secondary sort
on the receiver locations for each CDP gather in the Reduce phase. There are a few things to note about running SU commands on the
cluster:
- Most SU commands that are specified are run as-is by the system. The most notable exception is
susort
, which is performed by the framework, but is designed to be compatible with the standardsusort
command. - If the last SU command specified in the
command
argument is an X Windows command (e.g.,suximage
,suxwigb
), then the system will stream the results of running the pipeline to the client machine, where the X Windows command will be executed locally. Make sure that theCWPROOT
environment variable is specified on the client machine in order to support this option. - Certain commands that are not trace parallel (e.g.,
suop2
) will not work correctly on Seismic Hadoop. Also, commands that take additional input files will not work properly because the system will not copy those input files to the jobs running on the cluster. We plan to fix this limitation soon.