varadarajan77 / DSChallengeTA

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Requirements:

Files included:

  • report.ipynb: ipython notebook with the detailed analysis
  • solution.py: python file which uses pyspark libraries to process the dataset faster
  • solution_without_pyspark.py: python file which uses pandas alone to process the dataset
  • histogram_plot_unique_ips.png: PNG file showing the histogram plot of unique hashed ips (if cufflinks and plotly libraries are not installed)
  • histogram_plot_number_of_evenets.png: PNG file showing the histogram plot of number of events generated by users (if cufflinks and plotly librares are not installed)
  • output.csv: csv file containing the output of the desired features

Files not included:

  • Considering the size of the logs.csv file, it has not been pushed into github. Please clone this repository and copy the logs.csv file into the same folder before running the command to generate the output file.

Command to generate the output file:

  • ./solution.py logs.csv > output.csv (generates the output file faster as it uses pyspark)
  • ./solution_without_pyspark.py logs.csv > output.csv (generates the output file slower as it only uses pandas for processing)

Note: Change the shebang of the *.py files to point to the python3 installation of the respective computer

About


Languages

Language:Jupyter Notebook 100.0%Language:Python 0.0%