InsightDataScience_DataEngineering_CodingChallenge
Problem
Take data on immigration trends in .csv
format from input
directory and produce top 10 states and occupations with certified visa applications in separate files. Primary goal is scalability and reusability.
Running
- Place data files in .csv format in input directory.
- Execute
run.sh
. - Peruse output files in output directory.
Previous output files will be overwritten to preserve idempotence.
There are no user tunable parameters for now, any changes must be done using code.
Approach
From the given 10-line example, we see there's a header
- Assume order of elements is same in all inputs
- Assume separator is always
;
Efficient line by line reading of large files Since we require tracking the occupations and states, we can store their details in memory - they're categorical, finite (don't scale with file size) and small in number. Since we need their percentages of the total, we can add to the solution line by line. The procedural functions to be performed to get the required result:
- Open and read file efficiently.
- Check for certified applications
- Store counts for each state, occupation.
- Once whole file is read, sort categorical stats and get top 10.
- Print to output files.
Each of these should be separately testable.
Future work
Actual unit testing with custom, small and big data. Fix integration tests