RosbagInputFormat is an open source splitable Hadoop InputFormat for the rosbag file format.
java -jar lib/rosbaginputformat_2.11-0.9.0-SNAPSHOT.jar --version -f HMB_1.bag
The index is a very very small configuration file containing a protobuf array that will be given in the job configuration. Note the operation will not process and it will not parse the whole bag file, but will simply seek to the required offset.
java -jar lib/rosbaginputformat_2.11-0.9.0-SNAPSHOT.jar -f HMB_1.bag
Using your favorite tool put the bag file in your working HDFS folder. Note: keep the index json file as configuration to your jobs, do not put small files in HDFS. Assuming your HDFS working folder is /user/spark/ (e.g. hdfs dfs -mkdir /user/spark/)
hdfs dfs -put data/HMB_1.bag data/
hdfs dfs -ls data/
Note: your HDFS address might differ.
fin = sc.newAPIHadoopFile(
path = "hdfs://127.0.0.1:9000/user/spark/HMB_1.bag",
inputFormatClass = "de.valtech.foss.RosbagMapInputFormat",
keyClass = "org.apache.hadoop.io.LongWritable",
valueClass = "org.apache.hadoop.io.MapWritable",
conf = {"RosbagInputFormat.chunkIdx":"./HMB_1.bag.idx.bin"})
To interpret the messages we need the connections. We could get the connections as configuration as well. At the moment we decided to collect the connections into Spark driver in a dictionary and use it in the subsequent RDD actions. Note in the next version of the RosbagInputFormater alternative implementations will be given Collect the connections from all Spark partitions of the bag file into the Spark driver
conn_d = {str(k['topic']):k for k in fin.filter(lambda r: r[1]['header']['op'] == 7).map(lambda r: r[1]['header']).collect()}
conn_d.keys()