Running Awk in Parallel to process 256M Records

I wrote a blog post about this work. It was discussed some at Hacker News.

This repo contains the artefacts about the Smoky Mountains Data Challenge 2018 that I solved (and won first prize). In the following, I describe the approach, method and some interesting tidbits.

A pdf report may be found in the /report folder.

SMC Data Challenge 4 Scientific Publications Mining

. To run the awk code:

awk -f prob2.awk stop_words.txt data_dir/*.txt

. To compile the Swift code:

stc runprob2.swift #will generate tic file

. To run Swift code:

turbine -n 340 runprob2.tic

About

Running Awk in Parallel to process 256M records.

Languages

Language:Awk 47.2%Language:TeX 24.1%Language:Swift 9.8%Language:HTML 7.4%Language:Shell 4.8%Language:Python 3.6%Language:Gnuplot 2.5%Language:JavaScript 0.4%