R stringr 解析案例字母

R stringr Parse Cases Letters

提问人:bvowe 提问时间:11/9/2023 最后编辑:thelatemailbvowe 更新时间:11/11/2023 访问量:47

问:

HAVE            WANT1   WANT2
CLStephen Five  CL      Stephen Five
RTQQuent Lou X  RTQ     Quent Lou X

我们学校系统上存在数据输入错误,我有列“HAVE”,并希望将其分为“WANT1”和“WANT2”

WANT1 = take the first n-1 CAPITAL letters
WANT2 = take the remaining letters
R 纵梁

评论

1赞 bvowe 11/9/2023
@Friede数据显示在数据中,描述如下
3赞 thelatemail 11/9/2023
通过查看“编辑”窗口中的原始文本,其中有分隔列的选项卡。不幸的是,Stackoverflow 在显示的输出中不尊重它们。我现在已经根据选项卡对齐了文本。

答:

2赞 thelatemail 11/9/2023 #1

在 stringr 和 base R 中尝试:

x <- c("CLStephen Five","RTQQuent Lou X")

library(stringr) 
str_remove(x, "[A-Z][^A-Z].+")
#[1] "CL"  "RTQ"
str_extract(x, "[A-Z][^A-Z].+")
#[1] "Stephen Five" "Quent Lou X" 

sub("[A-Z][^A-Z].+", "", x)
#[1] "CL"  "RTQ"
sub("[A-Z]+([A-Z][^A-Z].+)", "\\1", x)
#[1] "Stephen Five" "Quent Lou X" 
1赞 Adriano Mello 11/11/2023 #2

另一个新的解决方案:tidyr::separate_wider_regex

library(dplyr)
library(tidyr)

df <- tibble(have = c("CLStephen Five","RTQQuent Lou X"))

# --------------------------  
separate_wider_regex(
  df,
  cols = have,
  patterns = c(
    want1 = "[A-Z]+(?=[A-Z][^A-Z])",
    want2 = "[A-Z]{1}[^A-Z].*"))

# A tibble: 2 × 2 ---------
  want1 want2       
  <chr> <chr>       
1 CL    Stephen Five
2 RTQ   Quent Lou X