sungoak / avro_pyspark

Read Avro Files into Pyspark

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

To build this, you need to first check out the latest spark source https://github.com/apache/spark and run sbt/sbt publish-local. This will put spark in maven's local repository.

You will also want to build this version of spark with ./make-distribution.sh --tgz

Then from this repository, run mvn package to create avro-0.1.jar.

To start up pyspark with with this library, run SPARK_CLASSPATH=/tmp/avro-0.1.jar bin/pyspark from the distribution you created above.

To load in an Avro file in pyspark use:

avroRdd = sc.newAPIHadoopFile("/tmp/data.avro", 
  "org.apache.avro.mapreduce.AvroKeyInputFormat", 
  "org.apache.avro.mapred.AvroKey", 
  "org.apache.hadoop.io.NullWritable",
  keyConverter="example.AvroGenericConverter")

About

Read Avro Files into Pyspark


Languages

Language:Scala 100.0%