提问人:John J. 提问时间:6/13/2023 最后编辑:jay.sfJohn J. 更新时间:6/14/2023 访问量:50
如何编写一个函数,该函数接受自己的输出作为输入并在循环中运行,直到满足条件?
How to write a function which accepts its own output as input and runs in a loop until a condition is met?
问:
我正在匹配多个字段的混乱数据。这是一个玩具的例子。
library(tidyverse)
df <- structure(list(id = c("0049984000", "3502234000", "4029979100",
"4331301000", "4690309000", "4690487000", "4690686000", "4702065000",
"4980108200"), OWNER_NAME_1 = c("CLAUDEAN L RING REV TRUST",
"S2 REAL ESTATE GROUP 5 LLC", "SAM STAIR", "S2 REAL ESTATE GROUP",
"S2 REAL ESTATE GROUP 5 LLC", "S2 REAL ESTATE GROUP 5 LLC", "S2 REAL ESTATE GROUP 5 LLC",
"S2 REAL ESTATE", "S2 REAL EST GROUP"),
OWNER_MAIL_ADDR = c("2045 PARADISE DR",
"11512 W WOODSIDE DR", "2925 W LINCOLN AVE", "2925 W LINCOLN AVE",
"2925 W LINCOLN AV", "2925 W LINCOLN AVE", "2925 W LINCOLN AV",
"11512 W WOODSIDE DR", "2925 W LINCOLN AVE"),
OWNER_CITY_STATE = c("WEST BEND, WI",
"HALES CORNERS, WI", "MILWAUKEE, WI", "MILWAUKEE, WI", "MILWAUKEE, WI",
"MILWAUKEE, WI", "MILWAUKEE, WI", "HALES CORNERS, WI", "MILWAUKEE, WI"
)), row.names = c(NA, -9L), class = c("tbl_df", "tbl", "data.frame"
))
df
# A tibble: 9 × 4
id OWNER_NAME_1 OWNER_MAIL_ADDR OWNER_CITY_STATE
<chr> <chr> <chr> <chr>
1 0049984000 CLAUDEAN L RING REV TRUST 2045 PARADISE DR WEST BEND, WI
2 3502234000 S2 REAL ESTATE GROUP 5 LLC 11512 W WOODSIDE DR HALES CORNERS, WI
3 4029979100 SAM STAIR 2925 W LINCOLN AVE MILWAUKEE, WI
4 4331301000 S2 REAL ESTATE GROUP 2925 W LINCOLN AVE MILWAUKEE, WI
5 4690309000 S2 REAL ESTATE GROUP 5 LLC 2925 W LINCOLN AV MILWAUKEE, WI
6 4690487000 S2 REAL ESTATE GROUP 5 LLC 2925 W LINCOLN AVE MILWAUKEE, WI
7 4690686000 S2 REAL ESTATE GROUP 5 LLC 2925 W LINCOLN AV MILWAUKEE, WI
8 4702065000 S2 REAL ESTATE 11512 W WOODSIDE DR HALES CORNERS, WI
9 4980108200 S2 REAL EST GROUP 2925 W LINCOLN AVE MILWAUKEE, WI
此函数接受值向量,并返回共享地址的所有值。OWNER_NAME_1
OWNER_NAME_1
# this function identifies all the OTHER names which share an address with the given name(s)
connect_owners_by_address <- function(landlord_names){
# parcels owned by given landlord name(s)
addresses1 <- df %>%
filter(OWNER_NAME_1 %in% landlord_names) %>%
group_by(OWNER_MAIL_ADDR, OWNER_CITY_STATE) %>%
summarise() %>%
ungroup()
# all owner names at addresses associated with first name(s)
names.at.addresses <- df %>%
inner_join(addresses1, by = join_by(OWNER_MAIL_ADDR, OWNER_CITY_STATE)) %>%
group_by(OWNER_NAME_1) %>%
summarise()
names.at.addresses$OWNER_NAME_1
}
函数的输出(名称的字符向量)的格式与输入的格式相同。通过在函数自己的输出上递归运行该函数,我可以识别更多匹配项。
例如:
# run once (4 matches)
connect_owners_by_address("SAM STAIR")
[1] "S2 REAL EST GROUP" "S2 REAL ESTATE GROUP" "S2 REAL ESTATE GROUP 5 LLC" "SAM STAIR"
# run twice (5 matches)
connect_owners_by_address("SAM STAIR") |>
connect_owners_by_address()
[1] "S2 REAL EST GROUP" "S2 REAL ESTATE" "S2 REAL ESTATE GROUP" "S2 REAL ESTATE GROUP 5 LLC"
[5] "SAM STAIR"
# run 3 times (still just 5 matches)
connect_owners_by_address("SAM STAIR") |>
connect_owners_by_address() |>
connect_owners_by_address()
[1] "S2 REAL EST GROUP" "S2 REAL ESTATE" "S2 REAL ESTATE GROUP" "S2 REAL ESTATE GROUP 5 LLC"
[5] "SAM STAIR"
我想将我的函数包装到另一个函数中,该函数在自己的输出上递归运行,并在输出长度等于输入长度时停止 - 在本例中为 5。connect_owners_by_address
我认为这将涉及循环,但我无法弄清楚如何将函数的输出提供回输入。任何建议都是值得赞赏的。while
答:
2赞
jay.sf
6/14/2023
#1
您正在寻找递归函数。你可以试试这个。
f <- function(x, d=df) {
m <- unique(merge(d, d[d$OWNER_NAME_1 %in% x, c("OWNER_MAIL_ADDR", "OWNER_CITY_STATE")])[, 'OWNER_NAME_1'])
if (identical(length(m), length(x))) {
return(x)
} else {
f(m, d)
}
}
f("SAM STAIR")
# [1] "S2 REAL ESTATE GROUP 5 LLC" "S2 REAL ESTATE" "SAM STAIR" "S2 REAL EST GROUP"
# [5] "S2 REAL ESTATE GROUP"
不涉及额外的包裹。
数据:
df <- structure(list(id = c("0049984000", "3502234000", "4029979100",
"4331301000", "4690309000", "4690487000", "4690686000", "4702065000",
"4980108200"), OWNER_NAME_1 = c("CLAUDEAN L RING REV TRUST",
"S2 REAL ESTATE GROUP 5 LLC", "SAM STAIR", "S2 REAL ESTATE GROUP",
"S2 REAL ESTATE GROUP 5 LLC", "S2 REAL ESTATE GROUP 5 LLC", "S2 REAL ESTATE GROUP 5 LLC",
"S2 REAL ESTATE", "S2 REAL EST GROUP"), OWNER_MAIL_ADDR = c("2045 PARADISE DR",
"11512 W WOODSIDE DR", "2925 W LINCOLN AVE", "2925 W LINCOLN AVE",
"2925 W LINCOLN AV", "2925 W LINCOLN AVE", "2925 W LINCOLN AV",
"11512 W WOODSIDE DR", "2925 W LINCOLN AVE"), OWNER_CITY_STATE = c("WEST BEND, WI",
"HALES CORNERS, WI", "MILWAUKEE, WI", "MILWAUKEE, WI", "MILWAUKEE, WI",
"MILWAUKEE, WI", "MILWAUKEE, WI", "HALES CORNERS, WI", "MILWAUKEE, WI"
)), row.names = c(NA, -9L), class = "data.frame")
1赞
one
6/14/2023
#2
一种方法是迭代运行函数并显式设置最大迭代次数,这样可以避免无限循环。
run_myfunction_iteratively <- function(df,input,niter=1000){
for(i in seq(niter)){
message(paste0("iteration: ",i))
if(i==1){
old_out <- new_out <- connect_owners_by_address(input)
}else{
new_out <- connect_owners_by_address(new_out)
if(length(old_out)==length(new_out)){
return(new_out)
}else{
old_out <- new_out
}
}
}
warning(paste0("The function does not converge in ",niter," iterations."))
}
run_myfunction_iteratively(df,"SAM STAIR",niter=10)
iteration: 1
`summarise()` has grouped output by 'OWNER_MAIL_ADDR'. You can override using the `.groups`
argument.
iteration: 2
`summarise()` has grouped output by 'OWNER_MAIL_ADDR'. You can override using the `.groups`
argument.
iteration: 3
`summarise()` has grouped output by 'OWNER_MAIL_ADDR'. You can override using the `.groups`
argument.
[1] "S2 REAL EST GROUP" "S2 REAL ESTATE" "S2 REAL ESTATE GROUP"
[4] "S2 REAL ESTATE GROUP 5 LLC" "SAM STAIR"
1赞
Ricardo Semião e Castro
6/14/2023
#3
递归包装器:
connect_owners_by_address_recursive <- function(landlord_names, length.out){
names.at.addresses <- connect_owners_by_address(landlord_names)
if(length(names.at.addresses) < length.out){
names.at.addresses <- connect_owners_by_address_recursive(names.at.addresses, length.out)
}
names.at.addresses
}
While 循环包装器:
connect_owners_by_address_while <- function(landlord_names, length.out){
names.at.addresses <- connect_owners_by_address(landlord_names)
while(length(names.at.addresses) < length.out){
names.at.addresses <- connect_owners_by_address(names.at.addresses)
}
names.at.addresses
}
将原始函数转换为递归函数:
connect_owners_by_address <- function(landlord_names, length.out = NULL){
# parcels owned by given landlord name(s)
addresses1 <- df %>%
filter(OWNER_NAME_1 %in% landlord_names) %>%
group_by(OWNER_MAIL_ADDR, OWNER_CITY_STATE) %>%
summarise() %>%
ungroup()
# all owner names at addresses associated with first name(s)
names.at.addresses <- df %>%
inner_join(addresses1, by = join_by(OWNER_CITY_STATE, OWNER_MAIL_ADDR)) %>%
group_by(OWNER_NAME_1) %>%
summarise() %>%
pull(OWNER_NAME_1)
if(!is.null(length.out) && length(names.at.addresses) < length.out){
connect_owners_by_address(names.at.addresses)
} else{
names.at.addresses
}
}
Obs:在我看来,使用使功能更干净。另外,如果您不想递归使用它,请不要传递 .pull()
length.out
评论