linkedin / photon-ml

A scalable machine learning library on Apache Spark

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

How can i get the avro format of dataset movielens?

hugstone opened this issue · comments

Hello, i have learned photon-ml tutorial on docker, if i want to run the demo on my local machine,How can i get the avro format of dataset movielens?

@hugstone I am also working with non-Avro formatted data so I wrote a Python script to transform from csv to sort of libvsm (with respect to a categorical variable as your grouping variable) and then modifying the existing dev-scripts provided by the LinkedIn team, you convert libsvm to avro.

Take a look at this gist.

Please note, the scripts are set up for legacy Python2.7.

The tricky part is the GAME.avsc and getting the avro schema correct.