benhoyt / goawk

A POSIX-compliant AWK interpreter written in Go, with CSV support

Home Page:https://benhoyt.com/writings/goawk/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Dynamically changing RS

microo8 opened this issue · comments

I've got a file where the first few bites define some of the attributes of the file. The 9th bite is the record separator.

I need to read this file, set RS and then read the file "again" but now separated by this new record separator.

Input file (here the record separator is '):

UNA:+,? 'UNB+UNOC:3+4042805000102:14+4016001000655:14+201231:0206+EC33218279A++TL'UNH+1+MSCONS:D:04B:UN:2.3'BGM+7+EC33218279A-1+9'DTM+137:202012310206:203'RFF+Z13:13018'NAD+MS+4042805000102::9'NAD+MR+4016001000655::9'UNS+D'NAD+DP'LOC+172+DE00108108359V0000000000000088446'DTM+163:202012300000?+01:303

This works on GNU awk:

BEGIN { RS=".{9}" }
NR==1 { $0=substr(RT,1,8); RS=substr(RT,9,1) }
{ print $0 }

output:

UNA:+,?
UNB+UNOC:3+4042805000102:14+4016001000655:14+201231:0206+EC33218279A++TL
UNH+1+MSCONS:D:04B:UN:2.3
BGM+7+EC33218279A-1+9
DTM+137:202012310206:203
RFF+Z13:13018
NAD+MS+4042805000102::9
NAD+MR+4016001000655::9
UNS+D
NAD+DP
LOC+172+DE00108108359V0000000000000088446
DTM+163:202012300000?+01:303

but not on goawk:

UNA:+,? 








Interesting, thanks for the report! This is a tricky one. It seems that GNU Gawk (and other AWKs) allow you to set RS at any time when reading from an input file, and it'll dynamically update RS and then read/parse the rest (the unread part) of the file. However, GoAWK uses bufio.Scanner on each input file, which doesn't have an API that allows dynamically updating this as you read (some of the data read would still be in its buffer).

I can reproduce your case if I save your input file to rstest.in and the program to rstest.awk:

$ gawk -f rstest.awk rstest.in 
UNA:+,? 
UNB+UNOC:3+4042805000102:14+4016001000655:14+201231:0206+EC33218279A++TL
UNH+1+MSCONS:D:04B:UN:2.3
BGM+7+EC33218279A-1+9
DTM+137:202012310206:203
RFF+Z13:13018
NAD+MS+4042805000102::9
NAD+MR+4016001000655::9
UNS+D
NAD+DP
LOC+172+DE00108108359V0000000000000088446
DTM+163:202012300000?+01:303

$ goawk -f rstest.awk rstest.in 
UNA:+,? 

... lots more blank lines ...

303

$

However, that program doesn't work in original-awk or mawk either, I guess because of the use of the Gawk-only RT variable. Here's a more portable program that shows the same "dynamic setting of RS" issue:

$ cat rstest2.awk
NR==1 { RS=substr($0,9,1) }
NR>1  { print $0 }
$ cat rstest.in rstest.in >rstest2.in
$ gawk -f rstest2.awk rstest2.in  # original-awk and mawk have the same output now
UNA:+,? 
UNB+UNOC:3+4042805000102:14+4016001000655:14+201231:0206+EC33218279A++TL
UNH+1+MSCONS:D:04B:UN:2.3
BGM+7+EC33218279A-1+9
DTM+137:202012310206:203
RFF+Z13:13018
NAD+MS+4042805000102::9
NAD+MR+4016001000655::9
UNS+D
NAD+DP
LOC+172+DE00108108359V0000000000000088446
DTM+163:202012300000?+01:303
$ goawk -f rstest2.awk rstest2.in 
UNA:+,? 'UNB+UNOC:3+4042805000102:14+4016001000655:14+201231:0206+EC33218279A++TL'UNH+1+MSCONS:D:04B:UN:2.3'BGM+7+EC33218279A-1+9'DTM+137:202012310206:203'RFF+Z13:13018'NAD+MS+4042805000102::9'NAD+MR+4016001000655::9'UNS+D'NAD+DP'LOC+172+DE00108108359V0000000000000088446'DTM+163:202012300000?+01:303
$ 

To work around this in GoAWK for now, I'd recommend actually reading (part of) the file twice. Note how rstest.in is specified twice on the command line. This works in GoAWK and other AWKs:

$ cat rstest3.awk 
NR==1   { RS=substr($0,9,1); next }
NR!=FNR { print $0 }
$ goawk -f rstest3.awk rstest.in rstest.in
UNA:+,? 
UNB+UNOC:3+4042805000102:14+4016001000655:14+201231:0206+EC33218279A++TL
UNH+1+MSCONS:D:04B:UN:2.3
BGM+7+EC33218279A-1+9
DTM+137:202012310206:203
RFF+Z13:13018
NAD+MS+4042805000102::9
NAD+MR+4016001000655::9
UNS+D
NAD+DP
LOC+172+DE00108108359V0000000000000088446
DTM+163:202012300000?+01:303

$ 

That said, I think this is a bug (or at least a quirk) of GoAWK, so I'm going to leave it open. I'm not sure the best way to fix it without revamping the use of bufio.Scanner. I think I'd need a scanner variant that can transfer the remaining/buffered bytes to a new scanner we dynamically changing RS.

@arnoldrobbins, any thoughts on this? Where is this behaviour (that one can change RS part way through a file) documented, or is it just assumed that this will work? I couldn't find it explicitly documented from a scan of RS in the Gawk manual, though I may have missed it.

It's just assumed it will work. RS is like any other variable that you can change at any time you like. I agree with your assessment, that this is a bug in GoAWK. In C this is handled fairly naturally; there's a buffer, RS matches the end of the text, and then you start again with whatever is in the current value of RS to find the next end of the buffer (with appropriate buffer management and filling from the file). HTH.

I think what I'll do here (at some point) is copy the bufio.Scanner implementation into the GoAWK codebase, add a Buffered() io.Reader method (similar to encoding/json's Decoder.Buffered), and then use that if changing RS in the middle of reading a file. If Buffered() works out well, propose adding Buffered to Go's bufio.Scanner.