提问人:Dennis Tseng 提问时间:9/19/2021 更新时间:9/19/2021 访问量:178
在 Windows 上检测汉字时出现 str_detect() 错误
str_detect() error when detecting chinese characters on Windows
问:
我在 Windows 机器上使用 Rstudio,并尝试与汉字进行一些字符串匹配。我不熟悉 Windows 上的编码设置,因此我检查了一些教程并确保结果应该是正确的。Sys.getlocale()
在 DataFrame 中执行时,匹配失败。但它在矢量层面上起作用。此外,使用 显示不同的结果。str_detect()
df_edu_village %>% filter(str_detect(village, "糖"))
df_edu_village %>% filter(str_detect(village, curl::curl_escape("糖") %>% curl::curl_unescape()))
下面我尝试重现结果但不起作用,问题可能是由于复制粘贴造成的,所以我自己用 Rmd 将结果编织成 HTML。感谢您的帮助。reprex()
# devtools::install_github("ntupsc/pscdata")
library(pscdata)
library(tidyverse)
## -- Attaching packages --------------------------------------- tidyverse 1.3.1 --
## v ggplot2 3.3.5 v purrr 0.3.4
## v tibble 3.1.4 v dplyr 1.0.7
## v tidyr 1.1.3 v stringr 1.4.0
## v readr 2.0.1 v forcats 0.5.1
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
df_edu_village <- pscdata::edu_village_original %>% as_tibble() %>% distinct(village)
df_edu_village %>% filter(str_detect(village, "糖"))
## Error: Problem with `filter()` input `..1`.
## i Input `..1` is `str_detect(village, "糖")`.
## x Syntax error in regex pattern. (U_REGEX_RULE_SYNTAX, context=`聶}`)
df_edu_village %>% filter(str_detect(village, curl::curl_escape("糖") %>% curl::curl_unescape()))
## # A tibble: 3 x 1
## village
## <chr>
## 1 糖<U+5ECD>里
## 2 糖友里
## 3 大糖里
df_edu_village %>% filter(str_detect(village, "里"))
## # A tibble: 0 x 1
## # ... with 1 variable: village <chr>
str_detect(df_edu_village$village[1], "里")
## [1] FALSE
grepl(df_edu_village$village[1], "里")
## [1] FALSE
str_detect("留侯里", "里")
## [1] TRUE
Sys.getlocale()
## [1] "LC_COLLATE=Chinese (Traditional)_Taiwan.950;LC_CTYPE=Chinese (Traditional)_Taiwan.950;LC_MONETARY=Chinese (Traditional)_Taiwan.950;LC_NUMERIC=C;LC_TIME=Chinese (Traditional)_Taiwan.950"
sessionInfo()
## R version 4.1.1 (2021-08-10)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 10 x64 (build 18362)
##
## Matrix products: default
##
## locale:
## [1] LC_COLLATE=Chinese (Traditional)_Taiwan.950
## [2] LC_CTYPE=Chinese (Traditional)_Taiwan.950
## [3] LC_MONETARY=Chinese (Traditional)_Taiwan.950
## [4] LC_NUMERIC=C
## [5] LC_TIME=Chinese (Traditional)_Taiwan.950
## system code page: 1252
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] forcats_0.5.1 stringr_1.4.0 dplyr_1.0.7 purrr_0.3.4
## [5] readr_2.0.1 tidyr_1.1.3 tibble_3.1.4 ggplot2_3.3.5
## [9] tidyverse_1.3.1 pscdata_0.1.0
##
## loaded via a namespace (and not attached):
## [1] tidyselect_1.1.1 xfun_0.26 haven_2.4.3 colorspace_2.0-2
## [5] vctrs_0.3.8 generics_0.1.0 htmltools_0.5.2 yaml_2.2.1
## [9] utf8_1.2.2 rlang_0.4.11 jquerylib_0.1.4 pillar_1.6.2
## [13] withr_2.4.2 glue_1.4.2 DBI_1.1.1 dbplyr_2.1.1
## [17] modelr_0.1.8 readxl_1.3.1 lifecycle_1.0.0 munsell_0.5.0
## [21] gtable_0.3.0 cellranger_1.1.0 rvest_1.0.1 evaluate_0.14
## [25] knitr_1.34 tzdb_0.1.2 fastmap_1.1.0 curl_4.3.2
## [29] fansi_0.5.0 broom_0.7.9 Rcpp_1.0.7 backports_1.2.1
## [33] scales_1.1.1 jsonlite_1.7.2 fs_1.5.0 hms_1.1.0
## [37] digest_0.6.27 stringi_1.7.4 grid_4.1.1 cli_3.0.1
## [41] tools_4.1.1 magrittr_2.0.1 crayon_1.4.1 pkgconfig_2.0.3
## [45] ellipsis_0.3.2 xml2_1.3.2 reprex_2.0.1 lubridate_1.7.10
## [49] rstudioapi_0.13 assertthat_0.2.1 rmarkdown_2.11 httr_1.4.2
## [53] R6_2.5.1 compiler_4.1.1
答: 暂无答案
评论
df <- tibble(village = c("糖里", "糖友里", "大糖里"))