cglamb/Go_CLA

Author: Charles Lamb
Contact Information: charlamb@gmail.com
Github address: https://github.com/cglamb/Go_CLA

Introduction

This project develops a command line application that calculates descriptive statistics. The application is developed in Go(lang). The application imports a CSV, calculates a set of statistics, and then exports results to a txt file. The set of descriptive statistics calculated is defined in the Statistics package below. The application was then benchmarked against Python and R scripts peforming the same operation on the same dataset. For purposes of benchmarking the application calculates the descriptive statistics 100 times. (Users not interested in benchmarking can avoid the iterations by modifying the iterations paramater within main.go).

Results - Consistency

Computed descriptive statistics were compared to results from R and Python and found to be consistent.

Reults - Benchmarking

The application developed in Golang executed slower than Python but faster than R. All three applications were executed on the same computer with only essential applications running. Python executed in 1.798 seconds, Go in 2.349 seconds, and R in 2.643 seconds. Logs from the benchmarking are provided in the /logs directory of the Github.

For purposes of the benchmarking all three applications computed the descriptive statistics 100 times and output the results to a txt file. The input data for all three applications was the housesInput.csv data in the /testdata folder of the Github

Recommendation

The author perceives two relevant strenghts of this Go(lang) application versus similiar scripts written in R or Python.
(1) Once written the application is easily deployed to a small executable. For non-technical end users who may not have R or Python IDE's this is a significant advantate. While Python and R scripts may also be deployed via an executable this process is more combersome and general requires specific libraries like (auto-py-to-exe for Python). Additionally as this statistics calculation performed in Python typically is performed using a Pandas library, a Python based executable would require a number of dependencies to be packaged within the script and thus would be larger. Hence the small size of the Go based executable (2103kb) is also an advantage.

Other Information

Data
Californa housing data from Miller, Thomas W. 2015. Modeling Techniques in Predictive Analytics with Python and R: A Guide to Data Science. Upper Saddle River, NJ: Pearson Education. [ISBN-13: 978-0-13-389206-2]

Running the Command Line Application
If the terminals current directory is the directory containing the executable, the program can be run from the command line using the following command: ./colStats_v4 -out_location output.txt -input_file testdata/housesInput.csv
The out_location can be changed to the desired location and filename for the output txt. The default location is the current directory and the default name is output.txt, if no user input is provided. The -input_file can be changed to the location and name of the CSV file being read. The default location is a testdata folder in the current directory and the default file name is housesInput.csv.

Explanation of files

colStats_v4.exe: Program executable scripted in Golang
csv.go: Go script continuing functions relevant to manipulating the csv
csv_test.go: Validates operatoin of the csv.go functions
errors.go: Builds errors used in the rest of the golang libraries
main.go: Go script. Contains func main()
main_test.go: Contains testing and benchmarking functions for main.go
stats.go: Contains the functions used to calculate statistics
stats_test.go: Validation tests for stats.go
output.txt: Descriptive statistics generated by colStat_V4.exe run against /testdata/housesInput.csv
/Comparable_Scripts/
runHouses.R: Identical operation performed in R
runHouses.py: Identical operation performed in Python
/logs/
benchmarkGo_log.txt: Go log from bechmarking
bechmarkPy_log.txt: Python benchmarking log
benchmarkR_log.txt: R benckmarking log
main_log.txt: log of executable being run from the command line
test_log.txt: go testing

Compiling Instructions

The executable can be compiled in go via the go build command within terminal.

Statistics Calculated

Count, mean, standard deviation, minimum, maximum, 25th percentile, 50th percentile, and 75th percentile are calculated. Standard deviation is calculated as sample standard deviation and thus divided by (n-1). While 25th, 50th, and 75th are specifically calculated, the code containts a Percentile(data,nth) function that will calculate for any nth percentile.

Hardware
All files were executed locally on the same hardware. The hardware used a 13th Gen Intel(R) Core(TM) i5-13400F, 2500 Mhz, 10 Core(s), 16 Logical Processor(s) with 16.0GB of physical memory.

Environments
The Python code was executed in Spyder (Python 3.10.9 64-bit | Qt 5.15.2 | PyQt5 5.15.10 | Windows 10).
The R script was executed in RStudio (Version 1.1.456).
The Go script was executed in Visual Code Studio (Version 1.85.1).

cglamb / Go_CLA

About

Languages