benhoyt / goawk

A POSIX-compliant AWK interpreter written in Go, with CSV support

Home Page:https://benhoyt.com/writings/goawk/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

CSV - loss of double quote when dataset is updated

patrickToca opened this issue · comments

Using goawk version 1.23.1 on Macos Ventura.

I get a result that appears to modify fields unexpectedly. Though the change aimed for is correctly done.

The command used:
goawk -i csv 'BEGIN {FS=OFS=","}{if ($41==9999) {$41="NULL"}};{$33="""$33"""; print $0}' source.csv > target.csv

See the source.csv and the target.csv content is shown below.

In source, the field $3 is double quoted.
In target, the field $3 loses the double quotes. Consequence: the number of fields is modified on certain records.

The $33 fields are correctly modified.
The $41 fields are correctly modified.

Could it be an error in the command spec. ?
or is it a goawk real issue?

----source.csv----
99190867052015021115470500009697,,"13a Providence Street",,WF1 3BG,672570090000,170,G,B1 Offices and Workshop businesses,2015-02-06,E08000036,E14001009,,2015-02-11,Mandatory issue (Marketed sale).,31,83,3,Grid Supplied Electricity,,,,31,33.99,21.24,56.64,115.71,No,,,4,Heating and Natural Ventilation,"13a Providence Street",Wakefield,Wakefield,WAKEFIELD,2015-02-11 15:47:05,,,13a
99206150022015021213585120790090,,"The Pizza Shop","55 Lake Lock Road",WF3 4HP,927051080000,156,G,A3/A4/A5 Restaurant and Cafes/Drinking Establishments and Hot Food takeaways,2015-01-30,E08000036,E14000826,,2015-02-12,Mandatory issue (Non-marketed sale).,34,99,3,Grid Supplied Electricity,,,,64,102.08,68.93,201.99,317.54,No,,,4,Heating and Natural Ventilation,"The Pizza Shop, 55 Lake Lock Road",Wakefield,Morley and Outwood,WAKEFIELD,2015-02-12 13:58:51,,63063364,Address Matched,55
99208510022015021215584572990411,UNITS 1-12 AND S1-S10,Bizspace,"Headway Business Park, Denby Dale Road",WF2 7AZ,179615280001,93,D,B8 Storage or Distribution,2015-01-14,E08000036,E14001009,,2015-02-12,Mandatory issue (Marketed sale).,26,76,3,Natural Gas,,,,25906,44.99,23.19,67.95,83.27,No,,,4,Heating and Natural Ventilation,"UNITS 1-12 AND S1-S10, Bizspace, Headway Business Park, Denby Dale Road",Wakefield,Wakefield,WAKEFIELD,2015-02-12 15:58:45,,,9999

----target.csv-----
99190867052015021115470500009697,,13a Providence Street,,WF1 3BG,672570090000,170,G,B1 Offices and Workshop businesses,2015-02-06,E08000036,E14001009,,2015-02-11,Mandatory issue (Marketed sale).,31,83,3,Grid Supplied Electricity,,,,31,33.99,21.24,56.64,115.71,No,,,4,Heating and Natural Ventilation,"13a Providence Street",Wakefield,Wakefield,WAKEFIELD,2015-02-11 15:47:05,,,13a
99206150022015021213585120790090,,The Pizza Shop,55 Lake Lock Road,WF3 4HP,927051080000,156,G,A3/A4/A5 Restaurant and Cafes/Drinking Establishments and Hot Food takeaways,2015-01-30,E08000036,E14000826,,2015-02-12,Mandatory issue (Non-marketed sale).,34,99,3,Grid Supplied Electricity,,,,64,102.08,68.93,201.99,317.54,No,,,4,Heating and Natural Ventilation,"The Pizza Shop, 55 Lake Lock Road",Wakefield,Morley and Outwood,WAKEFIELD,2015-02-12 13:58:51,,63063364,Address Matched,55
99208510022015021215584572990411,UNITS 1-12 AND S1-S10,Bizspace,Headway Business Park, Denby Dale Road,WF2 7AZ,179615280001,93,D,B8 Storage or Distribution,2015-01-14,E08000036,E14001009,,2015-02-12,Mandatory issue (Marketed sale).,26,76,3,Natural Gas,,,,25906,44.99,23.19,67.95,83.27,No,,,4,Heating and Natural Ventilation,"UNITS 1-12 AND S1-S10, Bizspace, Headway Business Park, Denby Dale Road",Wakefield,Wakefield,WAKEFIELD,2015-02-12 15:58:45,,,NULL

Hi @patrickToca, this is actually not a bug, but it happening because you're not outputting in CSV mode. You're using -i csv to set "CSV input mode", but you need -o csv as well, to set "CSV output mode". This will properly quote fields in the output that have commas in them. Note that in CSV input mode FS is ignored, and in CSV output mode OFS is ignored, so you don't need to set those.

The particular field that's tripping you up is field $4, which on line 3 has a comma in it: "Headway Business Park, Denby Dale Road". That's becoming two fields in the output, due to the comma. But in CSV output mode that is properly quoted.

See also the "NOTE" in the docs for CSV output mode -- you'll need to use a bare print rather than print $0 in CSV output mode.

Also, not that it's causing a problem, but you can shorten {if ($41==9999) {$41="NULL"}} to use an AWK pattern-action construct, instead of an if statement -- so it becomes $41==9999 {$41="NULL"}.

Overall, I believe the equivalent script to what you have, but handling quoting correctly, is as follows:

goawk -i csv -o csv '$41==9999 {$41="NULL"} { print }' source.csv > target2.csv

Thanks.