提问人:user13893942 提问时间:11/8/2023 更新时间:11/8/2023 访问量:44
如何对文本字符串进行切片,使其在数字后需要 X 个单词?
How to slice text string so it takes X number of words after number?
问:
我正在使用 R 来处理一些格式不一致的文本数据。下面是我的地址数据示例,它存储在名为“address”的列中。
dispatchid <- c(1,2,3,4,5,6)
address <- c("123 test st", "in front of 456 second st", "across the parking lot of 123 fourty one ave east","678 fourth ave w", "hospital, back entrance, 890 fifth road east", "123 sixth blvd apartment 111")
data <- data.frame(dispatchid, address)
如您所见,某些行将只存储地址,就像第 1 行和第 4 行一样。其他行在地址前面有其他信息。
我想做的是只提取地址部分,这样我的结果将如下所示:
dispatchid <- c(1,2,3,4,5,6)
address <- c("123 test st", "456 second st", "123 fourty one ave east","678 fourth ave w", "890 fifth road east", "123 sixth blvd")
data <- data.frame(dispatchid, address)
我以为我可以使用下面的代码来使用 word(),但它只给了我前 4 个单词。这不适用于在实际地址前面有很多前置文本的地址(即第 5 行)。
word(data$address, start = 1, end = 4, sep = " "))
答:
2赞
Jonathan V. Solórzano
11/8/2023
#1
您可以使用正则表达式来提取字符串的所需部分。在这种情况下,您可以使用正则表达式指示提取字符串中以数字 () 开头的部分,后跟任何内容 () 并在字符串 () 的末尾结束。\\d+
.+
$
library(dplyr)
library(stringr)
data |>
mutate(newaddress = str_extract(address, "\\d+.+$"))
# dispatchid address #newaddress
#1 1 123 test st 123 test st
#2 2 in front of 456 second st 456 second st
#3 3 across the parking lot of 123 fourty one ave east 123 fourty one ave east
#4 4 678 fourth ave w 678 fourth ave w
#5 5 hospital, back entrance, 890 fifth road east 890 fifth road east
#6 6 123 sixth blvd apartment 111 123 sixth blvd apartment 111
4赞
Onyambu
11/8/2023
#2
用于删除字符串开头的所有非数字:sub
data$address <- sub("^\\D+", "", data$address)
data
dispatchid address
1 1 123 test st
2 2 456 second st
3 3 123 fourty one ave east
4 4 678 fourth ave w
5 5 890 fifth road east
6 6 123 sixth blvd apartment 111
评论
0赞
Rui Barradas
11/8/2023
比我的简单多了,点赞。
1赞
Rui Barradas
11/8/2023
#3
下面是一个基本 R 解决方案。
dispatchid <- c(1,2,3,4,5,6)
address <- c("123 test st", "in front of 456 second st", "across the parking lot of 123 fourty one ave east","678 fourth ave w", "hospital, back entrance, 890 fifth road east", "123 sixth blvd apartment 111")
data <- data.frame(dispatchid, address)
sub("^[^\\d]* (\\d+.*$)", "\\1", data$address)
#> [1] "123 test st" "456 second st"
#> [3] "123 fourty one ave east" "678 fourth ave w"
#> [5] "890 fifth road east" "123 sixth blvd apartment 111"
创建于 2023-11-07 with reprex v2.0.2
正则表达式解释:
"^"
字符串的开头;"^[^\\d]*"
在字符串的开头,找到除数字零或更多次以外的任何字符;"\\d+.*$
一个数字一次或多次,后跟任何字符零次或多次,直到字符串末尾;"^[^\\d]* (\\d+.*$)"
上面的两个模式都用空格字符分隔。第二个模式是第一个捕获组,因为它位于括号之间。
如果此正则表达式找到任何内容并将其替换为第一个捕获组,则它只会保留从第 1 位数字开始并在输入字符串末尾结束的子字符串。"\\1"
评论