Introduction ============ Estado is a small java application that collects Hadoop job status including counters. Ultimate goal is to collect all job statistics from all Hadoop clusters and collect them in a central repository for further analysis, dash board and alerting. Example of an alert might be certain counter exceeding some threshold. Architecture ============ Estado should be setup as a cron job one for each cluster. it could run in any Hadoop node in a given cluster. There is a pluggable status consumer, that consumes status data and does whatever it wants to do. Typically, the consumer will write the status data in one central database, so that all Hadoop job status data is consolidated in one place. There is a default JobStatusConsoleConsumer class already included, that simply writes the data to console. In future, I will provide pluggable consumers for file and RDBMS. Configuration ============= The configuration is very simple. Here is an example. cluster.name=default status.filter=srfpk consumer.console.class=org.estado.spi.JobStatusConsoleConsumer Each hadoop cluster should be given an unique name. The filter parameter provides a list of job statuses to be used as filter, where s=succedded r=running f=failed p=preparing and k=killed The last line provides the consumer class name. The second component of the property name is a name (e.g., console) for the consumer. Some consumers may need additional parameters. Here is an example for a file consumer. consumer.file.class=org.estado.spi.JobStatusFileConsumer consumer.file.name=/var/log/estado.log Here is the configuration for an RDBMS cosumer using JDBC consumer.rdbms.class=org.estado.spi.JobStatusRdbmsConsumer consumer.rdbms.url=jdbc:mysql://localhost/starea?user=admin&password=welcome Execution ========= Here is a sample script to run estado JAR=/home/pranab/.m2/repository/mawazo/estado/1.0/estado-1.0.jar CL=org.estado.core.JobStatusChecker CONFIG=/home/pranab/Projects/run/estado/estado.properties hadoop jar $JAR $CL $CONFIG Sample Output ============= Here is some sample output for console consumer. Cluster:default Job Id:job_201107171959_0001 User:pranab Start time:2011-07-17 08:04:03 End time:2011-07-17 08:06:20 Duration:02min:17sec:230ms Map progress:100 Reduce progress:100 Status:completed Counter group:Job Counters Launched reduce tasks=1 SLOTS_MILLIS_MAPS=81621 Total time spent by all reduces waiting after reserving slots (ms)=0 Total time spent by all maps waiting after reserving slots (ms)=0 Launched map tasks=1 Data-local map tasks=1 SLOTS_MILLIS_REDUCES=52691 Counter group:FileSystemCounters FILE_BYTES_READ=349 HDFS_BYTES_READ=25157 FILE_BYTES_WRITTEN=115394 HDFS_BYTES_WRITTEN=331 Counter group:Map-Reduce Framework Reduce input groups=2 Combine output records=2 Map input records=1000 Reduce shuffle bytes=349 Reduce output records=2 Spilled Records=4 Map output bytes=19776 Combine input records=2000 Map output records=2000 SPLIT_RAW_BYTES=111 Reduce input records=2