teanalab / dbpedia2fields

Spark code for conversion of dbpedia turtle files into TrecText format used in our runs for https://github.com/iai-group/DBpedia-Entity

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Prerequisites:

Run TriplesToTrec with the following 13 files from DBPedia 2015-10 as input

  • anchor_text_en.ttl
  • article_categories_en.ttl
  • category_labels_en.ttl
  • infobox_properties_en.ttl
  • infobox_property_definitions_en.ttl
  • instance_types_transitive_en.ttl
  • labels_en.ttl
  • long_abstracts_en.ttl
  • mappingbased_literals_en.ttl
  • mappingbased_objects_en.ttl
  • page_links_en.ttl
  • persondata_en.ttl
  • short_abstracts_en.ttl

Some Spark parameters tuning is required to run it successfully, for example --executor-memory 22g --driver-memory 6g --conf spark.yarn.executor.memoryOverhead=1g.

Example command to run:

$ sbt assembly
$ $SPARK_HOME/bin/spark-submit --class 'edu.wayne.dbpedia2fields.TriplesToTrec' --master 'local[*]' --executor-memory 22g --driver-memory 6g --conf spark.yarn.executor.memoryOverhead=1g target/scala-2.10/dbpedia2fields-assembly-1.0.jar 'dbpedia-2015-10-subset/*.ttl' triples-to-trec

About

Spark code for conversion of dbpedia turtle files into TrecText format used in our runs for https://github.com/iai-group/DBpedia-Entity

License:MIT License


Languages

Language:Scala 100.0%