qinwf / jiebaR

Chinese text segmentation with R. R语言中文分词 (文档已更新 🎉 :https://qinwenfeng.com/jiebaR/ )

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

jiebaR在linux上报错

YuanboXu opened this issue · comments

在mac和window都可以跑通,但在linux环境下报错,最简单的分词语句都会报错。希望能够解决,谢谢!

环境信息

R version 3.3.1 (2016-06-21)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: CentOS release 6.5 (Final)

locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C LC_TIME=en_US.UTF-8
[4] LC_COLLATE=en_US.UTF-8 LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=en_US.UTF-8 LC_NAME=C LC_ADDRESS=C
[10] LC_TELEPHONE=C LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C

attached base packages:
[1] stats graphics grDevices utils datasets methods base

other attached packages:
[1] Rcpp_0.12.12 stringr_1.2.0 jiebaR_0.9.1 jiebaRD_0.1 lda_1.4.2
[6] dplyr_0.7.2 purrr_0.2.3 readr_1.1.1 tidyr_0.7.0 tibble_1.3.4
[11] ggplot2_2.2.1 tidyverse_1.1.1

loaded via a namespace (and not attached):
[1] cellranger_1.1.0 plyr_1.8.4 bindr_0.1 forcats_0.2.0
[5] tools_3.3.1 jsonlite_1.5 lubridate_1.6.0 nlme_3.1-128
[9] gtable_0.2.0 lattice_0.20-33 pkgconfig_2.0.1 rlang_0.1.2
[13] psych_1.7.5 parallel_3.3.1 haven_1.1.0 bindrcpp_0.2
[17] xml2_1.1.1 httr_1.3.1 hms_0.3 grid_3.3.1
[21] glue_1.1.1 R6_2.2.2 readxl_1.0.0 foreign_0.8-66
[25] reshape2_1.4.2 modelr_0.1.1 magrittr_1.5 scales_0.5.0
[29] rvest_0.3.2 assertthat_0.2.0 mnormt_1.5-5 colorspace_1.3-2
[33] stringi_1.1.5 lazyeval_0.2.0 munsell_0.4.3 broom_0.4.2

全部错误信息

Error in grep("(UCP)^[^⺀- 〡-﹏a-zA-Z0-9]$", result, perl = TRUE, :
invalid regular expression '(UCP)^[^⺀- 〡-﹏a-zA-Z0-9]$'
In addition: Warning message:
In grep("(UCP)^[^⺀- 〡-﹏a-zA-Z0-9]$", result, perl = TRUE, :
PCRE pattern compilation error
'this version of PCRE is not compiled with Unicode property support'
at '(UCP)^[^⺀- 〡-﹏a-zA-Z0-9]$'

最小可重复代码和数据文件,哪一步的代码出现错误

ct <- worker() a <- c("我爱你") wd <- segment(a,ct)

改报错是由于pcre在编译安装时,未带--enable-unicode-properties选项,解决办法,重新编译安装pcre,进入下载下pcre的源码包,执行
./configure --enable-utf8 --enable-unicode-properties
make
make install
执行完成后,全部无报错的情况下,执行
pcretest -C
看到如下输出:
[bdumodel@CDH2 ~]$ pcretest -C
PCRE version 8.41 2017-07-05
Compiled with
8-bit support
UTF-8 support
Unicode properties support
No just-in-time compiler support
Newline sequence is LF
\R matches all Unicode newlines
Internal link size = 2
POSIX malloc threshold = 10
Parentheses nest limit = 250
Default match limit = 10000000
Default recursion depth limit = 10000000
Match recursion uses stack