Guava Table<Integer, String, String> to JSAT DataSet

Question

Guava Table<Integer, String, String> to JSAT DataSet

salamanders opened this issue 8 years ago · comments

I got it working... but it was brutal, about 300 lines of code. I feel like I did it the hard way, but I wasn't sure if there was an easier way after reading the CSV parser code.

Parsing the Strings into Longs, Doubles, Strings
Finding out the "worst" type for each column and normalizing across the column
Making lookup tables for each column that needs it (small number of ints, or Strings)
Generate a dataset based on the output column name

Is there an easier way to do this?
Can it be part of the library?

class TableDataLoader

TableDataLoader(Table<Long, String, String>)
getDataSet(String)
tableToDataSet_Classification(ColumnInfo, List, SortedSet, int, int)
tableToDataSet_Regression(ColumnInfo, List, SortedSet, int, int)

class ColumnInfo

ColumnInfo(String, Map<Long, String>)
collectionToSortedUniqueStringList(Collection)
parseColumn(Map<Long, String>)
parseToLowestObject(String, Class<?>)
constructJSATCategoricalData()
constructLabelLookups()
getCategoricalData()
getName()
getType()
isLookup()
getRowValue(Number)
getKeyFromLookupId(int)
getAllRowKeys()

EdwardRaff · Answer 1 · Sat Aug 13 2016 08:01:25 GMT+0800 (China Standard Time)

I'm a little confused here. What is the Guava Table object representing with strings and 3 generic types? The super lazy thing would be to convert your table to a CSV and then use the CSV reader.... though I feel a little dirty just typing that out.

The CSV parser code isn't necessarily the best to read for understanding how to do something. That code (and the LIBSVM parser) are written to have a low GC impact by using a small state machine. This was done because for work I have some 100GB-500GB CSV and LIBSVM files that will fail with a JVM GC overhead exception implemented any other way.

Benjamin Hill · Answer 2 · Sat Aug 13 2016 11:44:24 GMT+0800 (China Standard Time)

I picked Table<row:Integer, columnName:String, cellValue:String> because it represents pretty much any tabular data structure read from disk or from a form post -- as long as it is small enough!

I think Table<row:Integer, columnName:String, cellValue:**(Long or Double or String)**> is the way to go because if they entire column's values are Longs, or a Doubles, or all Strings, then it maps well to how you need to transform the column (or if the column is the target deciding if it is a Classification or Regression problem).

The code I wrote tries to parse the String values to longs or doubles, decides on the "worst" per column and makes sure they are all the same time, then builds up a lookup table for the columns that need it, and outputs it all into DataSet. But ya... I think if I made better use of the various data row/point constructors, it could be half as much code.

EdwardRaff · Answer 3 · Sat Aug 13 2016 11:51:48 GMT+0800 (China Standard Time)

Hmm, do you really need Long as an option? For all but the largest values a double can store them losslessly. JSAT is going to save it as a double in the end anyway. That would simplify your code too.

How many use cases would an abstract class for this be helpful for? Is the use case you are imagining where you get datasets at runtime and don't know what types of features are in the data in advance?

Benjamin Hill · Answer 4 · Sat Aug 13 2016 12:19:28 GMT+0800 (China Standard Time)

I was using Long (or Integer would be fine) for class lookup columns. Some threshold where "if the column is a String, or an Int where there is < 20 unique values, then treat it as a catFeats column.

Is the use case you are imagining where you get datasets at runtime and don't know what types of features are in the data in advance?

Exactly.

Benjamin Hill · Answer 5 · Sun Aug 14 2016 03:15:35 GMT+0800 (China Standard Time)

Here is the code. I warned you - ugly. But maybe I could hack half of it out using better constructors?

https://gist.github.com/salamanders/cd42f99b8483e8d0d89f6edfa5b43a10

tableToDataSet_Classification and tableToDataSet_Regression have the interesting code, the rest is support material.

Benjamin Hill · Answer 6 · Fri Aug 19 2016 13:30:03 GMT+0800 (China Standard Time)

Fixed a bug in the number of cats and simplified the "int double or string" logic, now working pretty fast!