在 Windows 上检测汉字时出现 str_detect() 错误

str_detect() error when detecting chinese characters on Windows

提问人:Dennis Tseng 提问时间:9/19/2021 更新时间:9/19/2021 访问量:178

问:

我在 Windows 机器上使用 Rstudio,并尝试与汉字进行一些字符串匹配。我不熟悉 Windows 上的编码设置,因此我检查了一些教程并确保结果应该是正确的。Sys.getlocale()

在 DataFrame 中执行时,匹配失败。但它在矢量层面上起作用。此外,使用 显示不同的结果。str_detect()df_edu_village %>% filter(str_detect(village, "糖"))df_edu_village %>% filter(str_detect(village, curl::curl_escape("糖") %>% curl::curl_unescape()))

下面我尝试重现结果但不起作用,问题可能是由于复制粘贴造成的,所以我自己用 Rmd 将结果编织成 HTML。感谢您的帮助。reprex()

# devtools::install_github("ntupsc/pscdata")
library(pscdata)
library(tidyverse)

## -- Attaching packages --------------------------------------- tidyverse 1.3.1 --
## v ggplot2 3.3.5     v purrr   0.3.4
## v tibble  3.1.4     v dplyr   1.0.7
## v tidyr   1.1.3     v stringr 1.4.0
## v readr   2.0.1     v forcats 0.5.1
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()

df_edu_village <- pscdata::edu_village_original %>% as_tibble() %>% distinct(village)
df_edu_village %>% filter(str_detect(village, "糖"))
## Error: Problem with `filter()` input `..1`.
## i Input `..1` is `str_detect(village, "糖")`.
## x Syntax error in regex pattern. (U_REGEX_RULE_SYNTAX, context=`聶}`)
df_edu_village %>% filter(str_detect(village, curl::curl_escape("糖") %>% curl::curl_unescape()))
## # A tibble: 3 x 1
##   village
##   <chr>  
## 1 糖<U+5ECD>里 
## 2 糖友里 
## 3 大糖里
df_edu_village %>% filter(str_detect(village, "里"))
## # A tibble: 0 x 1
## # ... with 1 variable: village <chr>
str_detect(df_edu_village$village[1], "里")
## [1] FALSE
grepl(df_edu_village$village[1], "里")
## [1] FALSE
str_detect("留侯里", "里")
## [1] TRUE
Sys.getlocale()
## [1] "LC_COLLATE=Chinese (Traditional)_Taiwan.950;LC_CTYPE=Chinese (Traditional)_Taiwan.950;LC_MONETARY=Chinese (Traditional)_Taiwan.950;LC_NUMERIC=C;LC_TIME=Chinese (Traditional)_Taiwan.950"
sessionInfo()

## R version 4.1.1 (2021-08-10)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 10 x64 (build 18362)
## 
## Matrix products: default
## 
## locale:
## [1] LC_COLLATE=Chinese (Traditional)_Taiwan.950 
## [2] LC_CTYPE=Chinese (Traditional)_Taiwan.950   
## [3] LC_MONETARY=Chinese (Traditional)_Taiwan.950
## [4] LC_NUMERIC=C                                
## [5] LC_TIME=Chinese (Traditional)_Taiwan.950    
## system code page: 1252
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
##  [1] forcats_0.5.1   stringr_1.4.0   dplyr_1.0.7     purrr_0.3.4    
##  [5] readr_2.0.1     tidyr_1.1.3     tibble_3.1.4    ggplot2_3.3.5  
##  [9] tidyverse_1.3.1 pscdata_0.1.0  
## 
## loaded via a namespace (and not attached):
##  [1] tidyselect_1.1.1 xfun_0.26        haven_2.4.3      colorspace_2.0-2
##  [5] vctrs_0.3.8      generics_0.1.0   htmltools_0.5.2  yaml_2.2.1      
##  [9] utf8_1.2.2       rlang_0.4.11     jquerylib_0.1.4  pillar_1.6.2    
## [13] withr_2.4.2      glue_1.4.2       DBI_1.1.1        dbplyr_2.1.1    
## [17] modelr_0.1.8     readxl_1.3.1     lifecycle_1.0.0  munsell_0.5.0   
## [21] gtable_0.3.0     cellranger_1.1.0 rvest_1.0.1      evaluate_0.14   
## [25] knitr_1.34       tzdb_0.1.2       fastmap_1.1.0    curl_4.3.2      
## [29] fansi_0.5.0      broom_0.7.9      Rcpp_1.0.7       backports_1.2.1 
## [33] scales_1.1.1     jsonlite_1.7.2   fs_1.5.0         hms_1.1.0       
## [37] digest_0.6.27    stringi_1.7.4    grid_4.1.1       cli_3.0.1       
## [41] tools_4.1.1      magrittr_2.0.1   crayon_1.4.1     pkgconfig_2.0.3 
## [45] ellipsis_0.3.2   xml2_1.3.2       reprex_2.0.1     lubridate_1.7.10
## [49] rstudioapi_0.13  assertthat_0.2.1 rmarkdown_2.11   httr_1.4.2      
## [53] R6_2.5.1         compiler_4.1.1
r 正则表达式 UTF-8 字符编码 字符串

评论

0赞 Mark 7/28/2023
我无法重现这个,使用df <- tibble(village = c("糖里", "糖友里", "大糖里"))

答: 暂无答案