halpo / parser

R parser package

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Non-ASCII characters

daroczig opened this issue · comments

I encountered a strange issue with non-ASCII characters while using (the really awesome) parser. E.g. let us check out how a formula is parsed:

> parser::parser(text='hp ~ wt')
expression(hp ~ wt)
attr(,"data")
  line1 col1 byte1 line2 col2 byte2 token id parent top_level token.desc
1     1    0     0     1    2     2   263  1      4         0     SYMBOL
2     1    3     3     1    4     4   126  3      9         0        '~'
3     1    0     0     1    2     2    77  4      9         0       expr
4     1    5     5     1    7     7   263  6      8         0     SYMBOL
5     1    5     5     1    7     7    77  8      9         0       expr
6     1    0     0     1    7     7    77  9      0         0       expr
  terminal text
1     TRUE   hp
2     TRUE    ~
3    FALSE     
4     TRUE   wt
5    FALSE     
6    FALSE     
attr(,"file")
[1] "/tmp/RtmpHBZpw0/file62187b7f21c0"
attr(,"encoding")
[1] "unknown"
attr(,"class")

Is pretty cool, but when I try to parse a formula with some accented chars, I get this:

> parser::parser(text='hp ~ é')
Error in parser::parser(text = "hp ~ é") : 
/tmp/RtmpHBZpw0/file6218234db34e:1:5
        unexpected input

This does not happen with base::parse:

> parse(text='hp ~ é')
expression(hp ~ é)

Does it only happen with text or does it also happen with files. Windows or unix and what is your Locale.

Sorry, I am using Arch/Ubuntu Linux with various UTF-8 locales, but here goes my sessionInfo():

> sessionInfo()
R version 2.15.1 (2012-06-22)
Platform: x86_64-pc-linux-gnu (64-bit)

locale:
 [1] LC_CTYPE=en_US.utf8       LC_NUMERIC=C             
 [3] LC_TIME=en_US.utf8        LC_COLLATE=en_US.utf8    
 [5] LC_MONETARY=en_US.utf8    LC_MESSAGES=en_US.utf8   
 [7] LC_PAPER=C                LC_NAME=C                
 [9] LC_ADDRESS=C              LC_TELEPHONE=C           
[11] LC_MEASUREMENT=en_US.utf8 LC_IDENTIFICATION=C      

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] parser_0.0-16 Rcpp_0.9.13  

Also happens with files.

It seems that this issue is only happening with UTF-8 locales, which is quite strange. I have tested this on Windows with CP1250 with a vanilla, recent R install, also on Linux:

> Sys.setlocale(category='LC_ALL','hungarian')
[1] "LC_CTYPE=hungarian;LC_NUMERIC=C;LC_TIME=hungarian;LC_COLLATE=hungarian;LC_MONETARY=hungarian;LC_MESSAGES=hu_HU.utf8;LC_PAPER=C;LC_NAME=C;LC_ADDRESS=C;LC_TELEPHONE=C;LC_MEASUREMENT=hu_HU.utf8;LC_IDENTIFICATION=C"
> parser(text='hp ~ é'))
expression(hp ~ é)
attr(,"data")
  line1 col1 byte1 line2 col2 byte2 token id parent top_level token.desc
1     1    0     0     1    2     2   263  1      4         0     SYMBOL
2     1    3     3     1    4     4   126  3      9         0        '~'
3     1    0     0     1    2     2    78  4      9         0       expr
4     1    5     5     1    7     7   263  6      8         0     SYMBOL
5     1    5     5     1    7     7    78  8      9         0       expr
6     1    0     0     1    7     7    78  9      0         0       expr
  terminal text
1     TRUE   hp
2     TRUE    ~
3    FALSE     
4     TRUE   é
5    FALSE     
6    FALSE     
attr(,"file")
[1] "/tmp/RtmprRA3Jv/file427106a9c70"
attr(,"encoding")
[1] "unknown"
attr(,"class")
[1] "parser"
> Sys.setlocale(category='LC_ALL','hu_HU.iso88592')
[1] "LC_CTYPE=hu_HU.iso88592;LC_NUMERIC=C;LC_TIME=hu_HU.iso88592;LC_COLLATE=hu_HU.iso88592;LC_MONETARY=hu_HU.iso88592;LC_MESSAGES=hu_HU.utf8;LC_PAPER=C;LC_NAME=C;LC_ADDRESS=C;LC_TELEPHONE=C;LC_MEASUREMENT=hu_HU.utf8;LC_IDENTIFICATION=C"
> parser(text='hp ~ é'))
expression(hp ~ é)
attr(,"data")
  line1 col1 byte1 line2 col2 byte2 token id parent top_level token.desc
1     1    0     0     1    2     2   263  1      4         0     SYMBOL
2     1    3     3     1    4     4   126  3      9         0        '~'
3     1    0     0     1    2     2    78  4      9         0       expr
4     1    5     5     1    7     7   263  6      8         0     SYMBOL
5     1    5     5     1    7     7    78  8      9         0       expr
6     1    0     0     1    7     7    78  9      0         0       expr
  terminal text
1     TRUE   hp
2     TRUE    ~
3    FALSE     
4     TRUE   é
5    FALSE     
6    FALSE     
attr(,"file")
[1] "/tmp/RtmprRA3Jv/file4277ffb5b0c"
attr(,"encoding")
[1] "unknown"
attr(,"class")
[1] "parser"
> Sys.setlocale(category='LC_ALL','hu_HU.UTF-8')
[1] "LC_CTYPE=hu_HU.UTF-8;LC_NUMERIC=C;LC_TIME=hu_HU.UTF-8;LC_COLLATE=hu_HU.UTF-8;LC_MONETARY=hu_HU.UTF-8;LC_MESSAGES=hu_HU.utf8;LC_PAPER=C;LC_NAME=C;LC_ADDRESS=C;LC_TELEPHONE=C;LC_MEASUREMENT=hu_HU.utf8;LC_IDENTIFICATION=C"
> parser(text='hp ~ é')
Error in parser(text = "hp ~ é") : 
/tmp/RtmprRA3Jv/file42716da8122:1:5
    unexpected input

So IS-8859-2 and the special hungarian locale is OK, but whenever I change to UTF-8 (let it be Hungarian, GB/US English), parsing the formula fails with parser, but works OK with base::parse.

Unfortunately I have no ideas how to debug this further, but I would really love to help to track down this strange behavior. Please let me know if I could provide more details or if there would be any workaround.

This seems to be fixed in R 3.0.0:

> getParseData(parse(text='hp ~ é'))
  line1 col1 line2 col2 id parent  token terminal text
7     1    1     1    5  7      0   expr    FALSE     
1     1    1     1    2  1      3 SYMBOL     TRUE   hp
3     1    1     1    2  3      7   expr    FALSE     
2     1    4     1    4  2      7    '~'     TRUE    ~
4     1    5     1    5  4      6 SYMBOL     TRUE    é
6     1    5     1    5  6      7   expr    FALSE