Scala project to read stackexchange xml data file, process it's data and save it as CSV file.
Initially this project started as part of a data science project for tags prediction in stackexchange data, and after multiple versions of Streaming STAX
reader, Event Reader with multi-threading and Java Camel
ETL reader we end up with this solution that gives the most optimal solution as a performance matter (handled 51GB in about 70 minutes).
- Read XML File using event reader model within
AKKA
actor - Un-marshal data into Post model
- Filter posts answers
- Aggregate multiple posts as a batch and send it process
RoundRobinPool
of AKKA actors to handle batches- Normalize each post information (title, body and tags)
- Remove stop words using
BloomFilter
index - Remove HTML tags
- Remove numbers
- Remove stop words using
- Select random file within range to save the output as CSV for load-balancing
- Send acknowledge from the process-actor to the read-actor
- AKKA concurrent message driven library
- GUAVA Google Guave for Bloom Filter
- Scala CSV CSV Reader/Writer for Scala
- TestKit Akka test library
- ScalaTest Scala test library
XMLEventReader is not able to read xml encoded with BOM, you need to remove BOM with below command:
awk '{if(NR==1)sub(/^\xef\xbb\xbf/,"");print}' input_withbom.xml > input.xml