EdwardRaff / JSAT

Java Statistical Analysis Tool, a Java library for Machine Learning

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Guava Table<Integer, String, String> to JSAT DataSet

salamanders opened this issue · comments

I got it working... but it was brutal, about 300 lines of code. I feel like I did it the hard way, but I wasn't sure if there was an easier way after reading the CSV parser code.

  1. Parsing the Strings into Longs, Doubles, Strings
  2. Finding out the "worst" type for each column and normalizing across the column
  3. Making lookup tables for each column that needs it (small number of ints, or Strings)
  4. Generate a dataset based on the output column name

Is there an easier way to do this?
Can it be part of the library?

class TableDataLoader

  • TableDataLoader(Table<Long, String, String>)
  • getDataSet(String)
  • tableToDataSet_Classification(ColumnInfo, List, SortedSet, int, int)
  • tableToDataSet_Regression(ColumnInfo, List, SortedSet, int, int)

class ColumnInfo

  • ColumnInfo(String, Map<Long, String>)
  • collectionToSortedUniqueStringList(Collection)
  • parseColumn(Map<Long, String>)
  • parseToLowestObject(String, Class<?>)
  • constructJSATCategoricalData()
  • constructLabelLookups()
  • getCategoricalData()
  • getName()
  • getType()
  • isLookup()
  • getRowValue(Number)
  • getKeyFromLookupId(int)
  • getAllRowKeys()

I'm a little confused here. What is the Guava Table object representing with strings and 3 generic types? The super lazy thing would be to convert your table to a CSV and then use the CSV reader.... though I feel a little dirty just typing that out.

The CSV parser code isn't necessarily the best to read for understanding how to do something. That code (and the LIBSVM parser) are written to have a low GC impact by using a small state machine. This was done because for work I have some 100GB-500GB CSV and LIBSVM files that will fail with a JVM GC overhead exception implemented any other way.

I picked Table<row:Integer, columnName:String, cellValue:String> because it represents pretty much any tabular data structure read from disk or from a form post -- as long as it is small enough!

I think Table<row:Integer, columnName:String, cellValue:**(Long or Double or String)**> is the way to go because if they entire column's values are Longs, or a Doubles, or all Strings, then it maps well to how you need to transform the column (or if the column is the target deciding if it is a Classification or Regression problem).

The code I wrote tries to parse the String values to longs or doubles, decides on the "worst" per column and makes sure they are all the same time, then builds up a lookup table for the columns that need it, and outputs it all into DataSet. But ya... I think if I made better use of the various data row/point constructors, it could be half as much code.

Hmm, do you really need Long as an option? For all but the largest values a double can store them losslessly. JSAT is going to save it as a double in the end anyway. That would simplify your code too.

How many use cases would an abstract class for this be helpful for? Is the use case you are imagining where you get datasets at runtime and don't know what types of features are in the data in advance?

I was using Long (or Integer would be fine) for class lookup columns. Some threshold where "if the column is a String, or an Int where there is < 20 unique values, then treat it as a catFeats column.

Is the use case you are imagining where you get datasets at runtime and don't know what types of features are in the data in advance?

Exactly.

Here is the code. I warned you - ugly. But maybe I could hack half of it out using better constructors?

https://gist.github.com/salamanders/cd42f99b8483e8d0d89f6edfa5b43a10

tableToDataSet_Classification and tableToDataSet_Regression have the interesting code, the rest is support material.

Fixed a bug in the number of cats and simplified the "int double or string" logic, now working pretty fast!