Non-ASCII characters
daroczig opened this issue · comments
I encountered a strange issue with non-ASCII characters while using (the really awesome) parser
. E.g. let us check out how a formula is parsed:
> parser::parser(text='hp ~ wt')
expression(hp ~ wt)
attr(,"data")
line1 col1 byte1 line2 col2 byte2 token id parent top_level token.desc
1 1 0 0 1 2 2 263 1 4 0 SYMBOL
2 1 3 3 1 4 4 126 3 9 0 '~'
3 1 0 0 1 2 2 77 4 9 0 expr
4 1 5 5 1 7 7 263 6 8 0 SYMBOL
5 1 5 5 1 7 7 77 8 9 0 expr
6 1 0 0 1 7 7 77 9 0 0 expr
terminal text
1 TRUE hp
2 TRUE ~
3 FALSE
4 TRUE wt
5 FALSE
6 FALSE
attr(,"file")
[1] "/tmp/RtmpHBZpw0/file62187b7f21c0"
attr(,"encoding")
[1] "unknown"
attr(,"class")
Is pretty cool, but when I try to parse a formula with some accented chars, I get this:
> parser::parser(text='hp ~ é')
Error in parser::parser(text = "hp ~ é") :
/tmp/RtmpHBZpw0/file6218234db34e:1:5
unexpected input
This does not happen with base::parse
:
> parse(text='hp ~ é')
expression(hp ~ é)
Does it only happen with text or does it also happen with files. Windows or unix and what is your Locale.
Sorry, I am using Arch/Ubuntu Linux with various UTF-8 locales, but here goes my sessionInfo()
:
> sessionInfo()
R version 2.15.1 (2012-06-22)
Platform: x86_64-pc-linux-gnu (64-bit)
locale:
[1] LC_CTYPE=en_US.utf8 LC_NUMERIC=C
[3] LC_TIME=en_US.utf8 LC_COLLATE=en_US.utf8
[5] LC_MONETARY=en_US.utf8 LC_MESSAGES=en_US.utf8
[7] LC_PAPER=C LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.utf8 LC_IDENTIFICATION=C
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] parser_0.0-16 Rcpp_0.9.13
Also happens with files.
It seems that this issue is only happening with UTF-8
locales, which is quite strange. I have tested this on Windows with CP1250
with a vanilla, recent R install, also on Linux:
> Sys.setlocale(category='LC_ALL','hungarian')
[1] "LC_CTYPE=hungarian;LC_NUMERIC=C;LC_TIME=hungarian;LC_COLLATE=hungarian;LC_MONETARY=hungarian;LC_MESSAGES=hu_HU.utf8;LC_PAPER=C;LC_NAME=C;LC_ADDRESS=C;LC_TELEPHONE=C;LC_MEASUREMENT=hu_HU.utf8;LC_IDENTIFICATION=C"
> parser(text='hp ~ é'))
expression(hp ~ é)
attr(,"data")
line1 col1 byte1 line2 col2 byte2 token id parent top_level token.desc
1 1 0 0 1 2 2 263 1 4 0 SYMBOL
2 1 3 3 1 4 4 126 3 9 0 '~'
3 1 0 0 1 2 2 78 4 9 0 expr
4 1 5 5 1 7 7 263 6 8 0 SYMBOL
5 1 5 5 1 7 7 78 8 9 0 expr
6 1 0 0 1 7 7 78 9 0 0 expr
terminal text
1 TRUE hp
2 TRUE ~
3 FALSE
4 TRUE é
5 FALSE
6 FALSE
attr(,"file")
[1] "/tmp/RtmprRA3Jv/file427106a9c70"
attr(,"encoding")
[1] "unknown"
attr(,"class")
[1] "parser"
> Sys.setlocale(category='LC_ALL','hu_HU.iso88592')
[1] "LC_CTYPE=hu_HU.iso88592;LC_NUMERIC=C;LC_TIME=hu_HU.iso88592;LC_COLLATE=hu_HU.iso88592;LC_MONETARY=hu_HU.iso88592;LC_MESSAGES=hu_HU.utf8;LC_PAPER=C;LC_NAME=C;LC_ADDRESS=C;LC_TELEPHONE=C;LC_MEASUREMENT=hu_HU.utf8;LC_IDENTIFICATION=C"
> parser(text='hp ~ é'))
expression(hp ~ é)
attr(,"data")
line1 col1 byte1 line2 col2 byte2 token id parent top_level token.desc
1 1 0 0 1 2 2 263 1 4 0 SYMBOL
2 1 3 3 1 4 4 126 3 9 0 '~'
3 1 0 0 1 2 2 78 4 9 0 expr
4 1 5 5 1 7 7 263 6 8 0 SYMBOL
5 1 5 5 1 7 7 78 8 9 0 expr
6 1 0 0 1 7 7 78 9 0 0 expr
terminal text
1 TRUE hp
2 TRUE ~
3 FALSE
4 TRUE é
5 FALSE
6 FALSE
attr(,"file")
[1] "/tmp/RtmprRA3Jv/file4277ffb5b0c"
attr(,"encoding")
[1] "unknown"
attr(,"class")
[1] "parser"
> Sys.setlocale(category='LC_ALL','hu_HU.UTF-8')
[1] "LC_CTYPE=hu_HU.UTF-8;LC_NUMERIC=C;LC_TIME=hu_HU.UTF-8;LC_COLLATE=hu_HU.UTF-8;LC_MONETARY=hu_HU.UTF-8;LC_MESSAGES=hu_HU.utf8;LC_PAPER=C;LC_NAME=C;LC_ADDRESS=C;LC_TELEPHONE=C;LC_MEASUREMENT=hu_HU.utf8;LC_IDENTIFICATION=C"
> parser(text='hp ~ é')
Error in parser(text = "hp ~ é") :
/tmp/RtmprRA3Jv/file42716da8122:1:5
unexpected input
So IS-8859-2
and the special hungarian
locale is OK, but whenever I change to UTF-8
(let it be Hungarian, GB/US English), parsing the formula fails with parser
, but works OK with base::parse
.
Unfortunately I have no ideas how to debug this further, but I would really love to help to track down this strange behavior. Please let me know if I could provide more details or if there would be any workaround.
This seems to be fixed in R 3.0.0:
> getParseData(parse(text='hp ~ é'))
line1 col1 line2 col2 id parent token terminal text
7 1 1 1 5 7 0 expr FALSE
1 1 1 1 2 1 3 SYMBOL TRUE hp
3 1 1 1 2 3 7 expr FALSE
2 1 4 1 4 2 7 '~' TRUE ~
4 1 5 1 5 4 6 SYMBOL TRUE é
6 1 5 1 5 6 7 expr FALSE