如果在一组行中重复，则从字符串中删除单词-解网

问：

我有一个表，在第一个单元格中至少有两个字符串，我需要从中进行选择，并且只从较长的字符串中保留其中一个。

  library(qdap)
  library(magrittr)
  t<-  read.table(text="
V1,V2
  Video of all presentations and discussionPart 1Part 2Video,Part 1
  Video of all presentations and discussionPart 1Part 2Video,Part 2
  Video of all presentations and discussionPart 1Part 2Video,Video
  Background - PDFVideo Soil management and update (pdf)Video,PDF
  Background - PDFVideo Soil management and update (pdf)Video,Video
  Background - PDFVideo Soil management and update (pdf)Video,Soil management and update (pdf)
  Background - PDFVideo Soil management and update (pdf)Video,Video
",
                  header=T,sep = ",")

因此，对于此示例，我想省略 V1 第一行中的“第 2 部分”，并省略 V1 第二行中的“第 1 部分”。

这是我尝试过的：

t%>%  
  split(.,.$V1)%>% 
    lapply(.,function(x){(unique(x$V2))})%>%
      lapply(.,function(y){mgsub(pattern=y[[1]],replacement="",names(y))})

此尝试既不会更改较长的字符串，也不会保留唯一的较小字符串。

答案应如下所示：

t<-  read.table(text="
V1,V2
Video of all presentations and discussionPart 1,Part 1
Video of all presentations and discussionPart 2,Part 2
Video of all presentations and discussionVideo,Video
Background - PDF,PDF
Background - Video,Video
Background - Soil management and update (pdf),Soil management and update (pdf)
Background - Video,Video
",
header=T,sep = ",")

r l应用 gsub

如果没有更明确的排除标准，就没有很好的方法来确定要排除的内容，因为 V2 只定义了要保留的目标字符串。有时 V2 会出现两次，您希望同时保留两次（例如，“所有视频...视频“），但其他时候 V2 出现 2 次并且只需要 1 次（例如，”PDFVideo...（PDF格式）Video“是”Video“）。也许可以考虑使用“安全前缀”或“安全区段”列来标识永远不会被排除在外的部分？基本上，您需要一个过程来编码有关数据的任何已知边界/异常。这部分是其他人很难帮助的。

答：

1赞 andrew_reece 9/2/2023 #1

如果“-”之前的所有内容都是正确的，并且破折号之后你唯一想要的就是中的字符串，那么你可以抓住字符串的第一部分并将其连接起来：V2V2

library(tidyverse)

t |> 
  mutate(str_segment = str_split(V1, "-", n = 2)) |> 
  unnest_wider(str_segment, names_sep = "_") |> 
  mutate(new_v1 = paste0(str_segment_1, "-", V2)) |> 
  select(new_v1, V2)

# A tibble: 5 × 2
  new_v1                            V2      
  <chr>                             <chr>   
1 "  Video of Animals-Elephant"     Elephant
2 "  Video of Animals-Rhino"        Rhino   
3 "  Audio at loud volume-Sirens"   Sirens  
4 "  Audio at loud volume-Horns"    Horns   
5 "  Audio at loud volume-Crickets" Crickets

交互：

t |> 
  mutate(prefix = map(str_split(t$V1, "-", n=2), \(x) pluck(x, 1)),
         new_v1 = paste0(prefix, "-", V2)) |> 
  select(new_v1, V2)

如果在一组行中重复，则从字符串中删除单词

Remove words from string if duplicated in a group of rows

评论

评论