justinshapiro / csci3415-assignment2

A Java program that gathers the minimum, maximum and mean distance between the nearest neighbor of all stars in the HYG Database, imported from a CSV file.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Programming Assignment 2

Code by Justin Shapiro

The task for the second programing assignment was to write a program in Java that works with the HYG Database. Among the files that were provided from the HYG Database was the hygxyz.csv file, which contains various data about well-known stars, including their x, y, and z distances from our sun (“Sol”). The task was to read in this file, ignoring stars whose distances are not accurately known (marked by a value of 10000000 in each row’s Distance field). From the CSV file that is read in by the program, we were to use this data to compute the distance to each star’s nearest neighbor and gather statistics those distances such as minimum, maximum and average distance.

The design of a program that accomplishes the task given by the problem statement is surprisingly very straightforward. The program achieves the desired results by calling only two methods within its main method: getStarData and getStarStats.

The getStarData method uses Java’s built-in BufferedReader class to efficiently read text (in our case, lines) by placing each character from the input stream created by FileReader into a buffer. Then using BufferedReader’s handy readLine() method, the CSV file is iterated through line-by-line, processing the data of each line along the way. Given the simplicity of a CSV file (a file of “comma-separated values”), an array can easily be created from the data contained within it using a regular expression (in this case, our regex is a comma), and no third-party CSV-specific libraries are necessary. By using the split method, which takes in a regex argument as a string, an array is created such that element 9 contains the Distance data and elements 17, 18, and 19 respectively contain the X, Y, and Z data. Once the data from hygxyz.csv is contained in an array using this method, elements 17, 18, and 19 are each converted to a Double and stored in a global, two-dimensional ArrayList called Stars. This process continues for each star that has a Distance not equal to 10000000. Once numerical data from hgzxyz.csv is processed and stored in the ArrayList, getStars will return with the number of stars it has processed data for. The return value of getStars will be equal to the size of the global ArrayList.

The getStarStats method is the method that actually accomplishes the task given by the problem statement, but requires getStarData to set things up. getStarData gave us a convenient ArrayList that has only the necessary star data (X, Y, and Z), in a format that can be programmatically accessed. Given that the ArrayList has distance data for each star in three-dimensions, it is necessary to compute the distance between stars using the three-dimensional variant of the distance formula:

Distance Formula for three dimensions

A separate method computeDistance is used as an auxiliary method to getStarStats that is used specifically to compute distances between two stars. Each result returned to getStarStats by computeDistance is processed into the mean and checked if it qualifies as the minimum or the maximum. The final data for the minimum, maximum, and mean star distance is returned in the form of an ArrayList of three elements to the main method for printing. When the code is run, the following result is produced:

Result of computing all nearest neighbor distances from hygxyz.csv

The result produced (as pictured above) is almost identical to that of the example output provided by Dr. Williams. The only difference with my program’s output is in the last three digits of the mean. The mean provided by Dr. Williams is 43.352289954998135 while the mean my program produced was 43.352289954998085. Notice that the last three digits of Dr. Williams output (“135”) is different than my output (“085”). Assuming that Dr. Williams had used the exact same data set as mine, the reason for this slight difference in the mean may be due to the libraries used to perform the mathematical operations required for distance computation (exponentiation and a square root). I used the built-in java.lang.Math library’s sqrt and pow functions to accomplish this. If Dr. Williams used a different math library, that could be the reason for the difference in the mean. However, other possibilities for this difference may include the use of different JDK and JRE versions.

Despite the small difference in the output, I am very satisfied with how this program turned out, and I am also satisfied with the process of writing it. I have written in Java many times in the past and this assignment has only further proven that Java is the most convenient programming language I have ever used. An example of why Java is so convenient is in its input stream manipulation. Reading and processing a file in languages like C and C++ were always confusing to me, harder to use and required more time. However, with Java all I needed to do is pass the location of the CSV file to a FileStream constructor to produce a BufferedReader that allows me to use methods such as readLine to manipulate the file input stream. I am also very satisfied with the array creation using split with a regex argument. I have programmed extensively in Perl (which is often used in array manipulation of this kind) and I can confidently say that producing an array from a line of a CSV is the easiest when done in Java.

Overall, this assignment has demonstrated that Java is an extremely powerful general purpose language that is very convenient for the programmer to use due to its high-level of writability. Java’s fully object-oriented approach to a programming language provides high-level abstraction that, in most cases, does not lead to unexpected results given that the amount of debugging that I had to do was extremely minimal. Although I have programmed in Java extensively in the past, I have never used it for data processing until this assignment. Whenever I need to do data processing that requires arrays to be made out of data files, I use Perl. However, Perl lacks in writability when performing mathematical computations and comparisons. While most programming languages are stronger in some areas than others, Java seems to be a good choice for almost any programmable task.

About

A Java program that gathers the minimum, maximum and mean distance between the nearest neighbor of all stars in the HYG Database, imported from a CSV file.


Languages

Language:Java 100.0%