OguzUzman / stackexchange-xml-akka-importer

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

stackexchange-xml-akka-importer

Scala project to read stackexchange xml data file, process it's data and save it as CSV file. Initially this project started as part of a data science project for tags prediction in stackexchange data, and after multiple versions of Streaming STAX reader, Event Reader with multi-threading and Java Camel ETL reader we end up with this solution that gives the most optimal solution as a performance matter (handled 51GB in about 70 minutes).

Steps

  • Read XML File using event reader model within AKKA actor
  • Un-marshal data into Post model
  • Filter posts answers
  • Aggregate multiple posts as a batch and send it process
  • RoundRobinPool of AKKA actors to handle batches
  • Normalize each post information (title, body and tags)
    • Remove stop words using BloomFilter index
    • Remove HTML tags
    • Remove numbers
  • Select random file within range to save the output as CSV for load-balancing
  • Send acknowledge from the process-actor to the read-actor

Dependencies

Notes

XMLEventReader is not able to read xml encoded with BOM, you need to remove BOM with below command:

awk '{if(NR==1)sub(/^\xef\xbb\xbf/,"");print}' input_withbom.xml > input.xml

process

process

About


Languages

Language:Scala 100.0%