Table of Contents
- Maven Dependency
- Features
- Illustrations:
- Merging with a fuzzy match
- Joins
- Inner join
- Left join
- Right join
- Full join
- Join with custom functions
- Record functions:
- Doing a Select with a calculated field
- Iterating over records/tables
- Get Cell Value
- Delete Column
- CSV To MapList
- Sql To CSV
- Add Column
- Filter Records
- Sorting
- Ranges
- Transform each cell record
- Transposing
- Simplistic Aggregations
- Custom Aggregation
- Illustrations:
- Note:
<dependency>
<groupId>fuzzy-csv</groupId>
<artifactId>fuzzycsv</artifactId>
<version>${version}</version>
</dependency>
compile 'fuzzy-csv:fuzzycsv:1.6.11'
Repositories
<repositories>
<repository>
<id>kayr.repo.releases</id>
<url>http://omnitech.co.ug/m2/releases</url>
</repository>
</repositories>
- Merging using Fuzzy Matching with the help of the SecondString project(useful when you have misspelled column names in the different CSV files)
- Inner Join
- Right Join
- Left join
- Full Join
- Record functions
- Transposing
- Grouping
- Sum and Count Aggregate functions
- Lenient arithmetic operations i.e strings are coerced to numbers
Using the following as examples:
- Set the accuracy threshold to 75%
- Merge using code below
import static fuzzycsv.FuzzyCSVTable.tbl
def csv1 = [
['first name', 'sex'],
['alex', 'male'],
['sara', 'female']
]
def csv2 = [
['ferts nama', 'age', 'sex'],
['alex', '21', 'male'],
['peter', '21', 'male']
]
FuzzyCSV.ACCURACY_THRESHOLD.set(75) //set accuracy threshold
tbl(csv1).mergeByColumn(csv2).printTable()
This will output (Notice how it merged [first name] and [ferts nama])
first name sex age
---------- --- ---
alex male -
sara female -
alex male 21
peter male 21
_________
4 Rows
For now joins do not use fuzzy matching simply because in my use case it was not necessary
package fuzzycsv
import static fuzzycsv.FuzzyCSVTable.tbl
import static fuzzycsv.RecordFx.fn
def csv1 = [
['name', 'sex'],
['alex', 'male'],
['sara', 'female']
]
def csv2 = [
['name', 'age','hobby'],
['alex', '21','biking'],
['peter', '21','swimming']
]
def csv = tbl(csv1).join(csv2, 'name')
println csv
/*output
[name, sex, age, hobby]
[alex, male, 21, biking]*/
csv = tbl(csv1).leftJoin(csv2, 'name')
println csv
/*output
[name, sex, age, hobby]
[alex, male, 21, biking]
[sara, female, null, null]*/
csv = tbl(csv1).rightJoin(csv2, 'name')
println csv
/*output
[name, sex, age, hobby]
[alex, male, 21, biking]
[peter, null, 21, swimming]*/
csv = tbl(csv1).fullJoin(csv2, 'name')
println csv
/*output
[name, sex, age, hobby]
[alex, male, 21, biking]
[sara, female, null, null]
[peter, null, 21, swimming]*/
def csv = tbl(csv1).fullJoin(csv2){it.left('name') == it.right('name')}
println csv.toStringFormatted()
/*output
name sex name age hobby
---- --- ---- --- -----
alex male alex 21 biking
sara female sara - -
peter - peter 21 swimming
_________
3 Rows
*/
These Help you write expression or functions for a record. E.g A function multiplying price by quantity. The record function run in two modes:
- One with type coercion which can be created using
RecordFx.fn{}
.This mode is lenient and does not throw most exceptions. This mode supports division of nulls(null/2
), zero(2/0
) division and type coercion("2"/2 or Object/null
) . This mode adds extra overhead and is much slower if your are dealing with lots of records. - Another mode is
RecordFx.fx{}
which uses the default groovy evaluator. This mode is much faster if you are working with lots of records. However this mode is not lenient and hence can throwjava.lang.ArithmeticException: Division by zero
. If you want to enable leniency but still want to use the fasterRecordFx.fx{}
you can wrap your code in thefuzzycsv.FxExtensions
category(e.guse(FxExtensions){ ...code here.. }
) So the category is registered only once as compared to the former where the category is reqistered on each and every evaluation.
import static fuzzycsv.FuzzyCSVTable.tbl
import static fuzzycsv.RecordFx.fn
def csv2 = [
['price', 'quantity'],
['2', '40'],
['3', '20']
]
def csv = tbl(csv2).select('price', 'quantity', fn('total') { it.price * it.quantity })
println csv
//output
//[price, quantity, total]
//[2, 40, 80]
//[3, 20, 60]
Consider we have the following csv
def csv2 = [
['name', 'age','hobby'],
['alex', '21','biking'],
['peter', '21','swimming']
]
tbl(csv).each{println(r.name)}
/*output
alex
peter*/
println tbl(csv2)['name'][2]
//output
peter
println tbl(csv2).delete('name','age')
//output
//[hobby]
//[biking]
//[swimming]
println tbl(csv2).toMapList()
/*output
[[name:alex, age:21, hobby:biking], [name:peter, age:21, hobby:swimming]]
*/
tbl(FuzzyCSV.toCSV(sql, 'select * from PERSON'))
println tbl(csv2).addColumn(fn('Double Age') {it.age * 2})
//output
//[name, age, hobby, Double Age]
//[alex, 21, biking, 42]
//[peter, 21, swimming, 42]
println tbl(csv2).filter { it.name == 'alex' }.toStringFormatted()
/*output
name age hobby
---- --- -----
alex 21 biking
_________
1 Rows
*/
import static fuzzycsv.FuzzyCSVTable.tbl
def csv2 = [
['name', 'age','hobby'],
['alex', '21','biking'],
['martin', '40','swimming'],
['dan', '25','swimming'],
['peter', '21','swimming'],
]
tbl(csv2).sort('age','name').printTable()
//or sort using closure
tbl(csv2).sort{"$it.age $it.name"}.printTable()
/*Output for both
name age hobby
---- --- -----
alex 21 biking
peter 21 swimming
dan 25 swimming
martin 40 swimming
_________
4 Rows
*/
Ranges help slice the csv record..e.g selecting last 2, top 2, 3rd to 2nd last record
import static fuzzycsv.FuzzyCSVTable.tbl
def table = tbl([
['name', 'age','hobby'],
['alex', '21','biking'],
['martin', '40','swimming'],
['dan', '25','swimming'],
['peter', '21','swimming'],
])
//top 2
table[1..2].printTable()
/*Output
name age hobby
---- --- -----
alex 21 biking
martin 40 swimming
_________
2 Rows
*/
//last 2
table[-1..-2].printTable()
/*
name age hobby
---- --- -----
peter 21 swimming
dan 25 swimming
_________
2 Rows
*/
import static fuzzycsv.FuzzyCSVTable.tbl
def table = tbl([
['name', 'age','hobby'],
['alex', '21','biking'],
['martin', '40','swimming'],
['dan', '25','swimming'],
['peter', '21','swimming'],
])
table.transform {it.padRight(10,'-')}.printTable()
/*
name age hobby
---- --- -----
alex------ 21-------- biking----
martin---- 40-------- swimming--
dan------- 25-------- swimming--
peter----- 21-------- swimming--
_________
4 Rows
*/
println tbl(csv2).transpose()
/*output
[name, alex, peter]
[age, 21, 21]
[hobby, biking, swimming]
*/
Example 2:
def csv2 = [
['name', 'age','hobby','id'],
['alex', '21','biking',1],
['peter', '21','swimming',2]
]
//name = Column To Become Header
//age = Column Needed in Cells
//id and hobby = Columns that uniquely identify a record/row
println tbl(csv2).transpose('name','age','id','hobby')
/*output
[id, hobby, alex, peter]
[1, biking, 21, null]
[2, swimming, null, 21]
*/
In the example below we find the average age in each hobby by making use of sum count and group by functions
def csv2 = [
['name', 'age', 'Hobby'],
['alex', '21', 'biking'],
['peter', '21', 'swimming'],
['davie', '15', 'swimming'],
['sam', '16', 'biking'],
]
println tbl(csv2)
.autoAggregate(
'Hobby',
sum('age').az('TT.Age'),
count('name').az('TT.Count'),
).toStringFormatted()
/*output
Hobby TT.Age TT.Count
----- ------ --------
biking 37 2
swimming 36 2
_________
2 Rows
*/
tbl(csv2).autoAggregate(
'Hobby',
reduce { FuzzyCSVTable group -> group['age'] }.az('AgeList')
).printTable()
/*output
Hobby AgeList
----- -------
biking [21, 16]
swimming [21, 15]
_________
2 Rows
*/
This library has not been tested with very large CSV files. So performance might be a concern
More example can be seen here
https://github.com/kayr/fuzzy-csv/blob/master/src/test/groovy/fuzzycsv/FuzzyCSVTest.groovy
and
https://github.com/kayr/fuzzy-csv/blob/master/src/test/groovy/fuzzycsv/FuzzyCSVTableTest.groovy