R：删除字符串中的所有带引号的值-解网

问：

我正在使用 Twitter 数据在 R 中启动我的第一个文本分析项目，在预处理阶段，我正在尝试删除引号内出现的所有值。我发现了一些代码可以删除引号本身，但不会删除其中的值（例如，“Hello World”变成 Hello World），但没有任何东西可以始终删除值和引号（例如，This is a “quoted text” 变成 This is a）。

我已经匿名化了我正在处理的一个示例数据框（保留了这些特定推文的确切格式，只是内容发生了变化）：


    df <- data.frame(text = c("Example: “This is a quote!” https://t.co/ -  MORE TEXT - example: “more text... “quote inside a quote” finished.”", 
                              "Text \"this is a quote.\" More text. https://t.co/"))

对于此数据帧，目标是最终得到：

Example: https://t.co/ -  MORE TEXT - example: 

Text More text. https://t.co/

我试过这些：

df$text <- gsub('"[^"]+"', '', df$text)

df$text <- gsub('".*"', '', df$text)

df$text <- gsub("[\"'].*['\"]","", df$text)

但我发现它只适用于成功删除第二个观察中的引文，而不是第一个观察。我怀疑这可能与第二个引号是如何从 Twitter 导入的有关，用 \ 括起来。我不确定这个假设是否正确，如果是正确的，我不确定如何克服它。任何帮助将不胜感激！

R Regex GSUB 行情

df <- data.frame(text = c("Example: “This is a quote!” https://t.co/ -  MORE TEXT - example: “more text... “quote inside a quote” finished.”", 
                          "Text \"this is a quote.\" More text. https://t.co/"))

df$text |>
  gsub('(“|")[^”"“]*(”|")', '', x = _) |>
  gsub('(“|")[^”"]*(”|")', '', x = _)
#> [1] "Example:  https://t.co/ -  MORE TEXT - example: "
#> [2] "Text  More text. https://t.co/"

Tidyverse的

df <- data.frame(text = c("Example: “This is a quote!” https://t.co/ -  MORE TEXT - example: “more text... “quote inside a quote” finished.”", 
                          "Text \"this is a quote.\" More text. https://t.co/"))
df$text
#> [1] "Example: “This is a quote!” https://t.co/ -  MORE TEXT - example: “more text... “quote inside a quote” finished.”"
#> [2] "Text \"this is a quote.\" More text. https://t.co/"

library(stringr)
library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union

df %>% 
  mutate(text = str_remove_all(text, '(“|")[^”"“]*(”|")'),
         text = str_remove_all(text, '(“|")[^”"]*(”|")'))
#>                                               text
#> 1 Example:  https://t.co/ -  MORE TEXT - example: 
#> 2                   Text  More text. https://t.co/

gsub('("([^"]|(?R))*")|(“([^“”]|(?R))*”)', '', df$text, perl=TRUE)
#[1] "Example:  https://t.co/ -  MORE TEXT - example: "
#[2] "Text  More text. https://t.co/"

gsub('"[^"]*"|(“([^“”]|(?R))*”)', '', df$text, perl=TRUE)
#[1] "Example:  https://t.co/ -  MORE TEXT - example: "
#[2] "Text  More text. https://t.co/"                  

gsub('".*"|(“([^“”]|(?R))*”)', '', df$text, perl=TRUE)
#[1] "Example:  https://t.co/ -  MORE TEXT - example: "
#[2] "Text  More text. https://t.co/"

2赞 Chris Ruehlemann 5/4/2023 #3

下面是使用单行模式的解决方案：

library(tidyverse)
df %>%
  mutate(text = str_remove_all(text, '"[^"]+"|“[^“”]+”|“.+”'))
                                              text
1 Example:  https://t.co/ -  MORE TEXT - example: 
2                   Text  More text. https://t.co/

该模式处理使用三种替代模式时显示的可变性：text

"[^"]+"：第一种选择：删除"
“[^“”]+”：第二种选择：删除和“”
“.+”：第三种选择：删除包裹在和“”

如果在实际数据中也有嵌套引号，则可以用另一种替代来解释。" "

R：删除字符串中的所有带引号的值

R: Removing all quoted values in a string

评论

评论

评论