InsightDataScience_DataEngineering_CodingChallenge

Problem

Take data on immigration trends in .csv format from input directory and produce top 10 states and occupations with certified visa applications in separate files. Primary goal is scalability and reusability.

Running

Place data files in .csv format in input directory.
Execute run.sh.
Peruse output files in output directory.

Previous output files will be overwritten to preserve idempotence.

There are no user tunable parameters for now, any changes must be done using code.

Approach

From the given 10-line example, we see there's a header

Assume order of elements is same in all inputs
Assume separator is always ;

Efficient line by line reading of large files Since we require tracking the occupations and states, we can store their details in memory - they're categorical, finite (don't scale with file size) and small in number. Since we need their percentages of the total, we can add to the solution line by line. The procedural functions to be performed to get the required result:

Open and read file efficiently.
Check for certified applications
Store counts for each state, occupation.
Once whole file is read, sort categorical stats and get top 10.
Print to output files.

Each of these should be separately testable.

Future work

Actual unit testing with custom, small and big data. Fix integration tests

snugghash / InsightDataScience_DataEngineering_CodingChallenge

InsightDataScience_DataEngineering_CodingChallenge

Problem

Running

Approach

Future work

About

Languages