如何编写一个函数,该函数接受自己的输出作为输入并在循环中运行,直到满足条件?

How to write a function which accepts its own output as input and runs in a loop until a condition is met?

提问人:John J. 提问时间:6/13/2023 最后编辑:jay.sfJohn J. 更新时间:6/14/2023 访问量:50

问:

我正在匹配多个字段的混乱数据。这是一个玩具的例子。

library(tidyverse)
df <- structure(list(id = c("0049984000", "3502234000", "4029979100", 
                            "4331301000", "4690309000", "4690487000", "4690686000", "4702065000", 
                            "4980108200"), OWNER_NAME_1 = c("CLAUDEAN L RING REV TRUST", 
                                                            "S2 REAL ESTATE GROUP 5 LLC", "SAM STAIR", "S2 REAL ESTATE GROUP", 
                                                            "S2 REAL ESTATE GROUP 5 LLC", "S2 REAL ESTATE GROUP 5 LLC", "S2 REAL ESTATE GROUP 5 LLC", 
                                                            "S2 REAL ESTATE", "S2 REAL EST GROUP"),
                     OWNER_MAIL_ADDR = c("2045 PARADISE DR", 
                                         "11512 W WOODSIDE DR", "2925 W LINCOLN AVE", "2925 W LINCOLN AVE", 
                                         "2925 W LINCOLN AV", "2925 W LINCOLN AVE", "2925 W LINCOLN AV", 
                                         "11512 W WOODSIDE DR", "2925 W LINCOLN AVE"), 
                     OWNER_CITY_STATE = c("WEST BEND, WI", 
                                          "HALES CORNERS, WI", "MILWAUKEE, WI", "MILWAUKEE, WI", "MILWAUKEE, WI", 
                                          "MILWAUKEE, WI", "MILWAUKEE, WI", "HALES CORNERS, WI", "MILWAUKEE, WI"
                     )), row.names = c(NA, -9L), class = c("tbl_df", "tbl", "data.frame"
                     ))

df
# A tibble: 9 × 4
  id         OWNER_NAME_1               OWNER_MAIL_ADDR     OWNER_CITY_STATE 
  <chr>      <chr>                      <chr>               <chr>            
1 0049984000 CLAUDEAN L RING REV TRUST  2045 PARADISE DR    WEST BEND, WI    
2 3502234000 S2 REAL ESTATE GROUP 5 LLC 11512 W WOODSIDE DR HALES CORNERS, WI
3 4029979100 SAM STAIR                  2925 W LINCOLN AVE  MILWAUKEE, WI    
4 4331301000 S2 REAL ESTATE GROUP       2925 W LINCOLN AVE  MILWAUKEE, WI    
5 4690309000 S2 REAL ESTATE GROUP 5 LLC 2925 W LINCOLN AV   MILWAUKEE, WI    
6 4690487000 S2 REAL ESTATE GROUP 5 LLC 2925 W LINCOLN AVE  MILWAUKEE, WI    
7 4690686000 S2 REAL ESTATE GROUP 5 LLC 2925 W LINCOLN AV   MILWAUKEE, WI    
8 4702065000 S2 REAL ESTATE             11512 W WOODSIDE DR HALES CORNERS, WI
9 4980108200 S2 REAL EST GROUP          2925 W LINCOLN AVE  MILWAUKEE, WI    

此函数接受值向量,并返回共享地址的所有值。OWNER_NAME_1OWNER_NAME_1

# this function identifies all the OTHER names which share an address with the given name(s)
connect_owners_by_address <- function(landlord_names){
  # parcels owned by given landlord name(s)
  addresses1 <- df %>%
    filter(OWNER_NAME_1 %in% landlord_names) %>%
    group_by(OWNER_MAIL_ADDR, OWNER_CITY_STATE) %>%
    summarise() %>%
    ungroup()
  
  # all owner names at addresses associated with first name(s)
  names.at.addresses <- df %>%
    inner_join(addresses1, by = join_by(OWNER_MAIL_ADDR, OWNER_CITY_STATE)) %>%
    group_by(OWNER_NAME_1) %>%
    summarise()
  
  names.at.addresses$OWNER_NAME_1
}

函数的输出(名称的字符向量)的格式与输入的格式相同。通过在函数自己的输出上递归运行该函数,我可以识别更多匹配项。

例如:


# run once (4 matches)
connect_owners_by_address("SAM STAIR")
[1] "S2 REAL EST GROUP"          "S2 REAL ESTATE GROUP"       "S2 REAL ESTATE GROUP 5 LLC" "SAM STAIR"   

# run twice (5 matches)
connect_owners_by_address("SAM STAIR") |>
  connect_owners_by_address()
[1] "S2 REAL EST GROUP"          "S2 REAL ESTATE"             "S2 REAL ESTATE GROUP"       "S2 REAL ESTATE GROUP 5 LLC"
[5] "SAM STAIR" 

# run 3 times (still just 5 matches)
connect_owners_by_address("SAM STAIR") |>
  connect_owners_by_address() |>
  connect_owners_by_address()
[1] "S2 REAL EST GROUP"          "S2 REAL ESTATE"             "S2 REAL ESTATE GROUP"       "S2 REAL ESTATE GROUP 5 LLC"
[5] "SAM STAIR" 

我想将我的函数包装到另一个函数中,该函数在自己的输出上递归运行,并在输出长度等于输入长度时停止 - 在本例中为 5。connect_owners_by_address

我认为这将涉及循环,但我无法弄清楚如何将函数的输出提供回输入。任何建议都是值得赞赏的。while

r 递归 while-loop

评论


答:

2赞 jay.sf 6/14/2023 #1

您正在寻找递归函数。你可以试试这个。

f <- function(x, d=df) {
  m <- unique(merge(d, d[d$OWNER_NAME_1 %in% x, c("OWNER_MAIL_ADDR", "OWNER_CITY_STATE")])[, 'OWNER_NAME_1'])
  if (identical(length(m), length(x))) {
    return(x)
  } else {
    f(m, d)
  }
}

f("SAM STAIR")
# [1] "S2 REAL ESTATE GROUP 5 LLC" "S2 REAL ESTATE"             "SAM STAIR"                  "S2 REAL EST GROUP"         
# [5] "S2 REAL ESTATE GROUP"  

不涉及额外的包裹。


数据:

df <- structure(list(id = c("0049984000", "3502234000", "4029979100", 
"4331301000", "4690309000", "4690487000", "4690686000", "4702065000", 
"4980108200"), OWNER_NAME_1 = c("CLAUDEAN L RING REV TRUST", 
"S2 REAL ESTATE GROUP 5 LLC", "SAM STAIR", "S2 REAL ESTATE GROUP", 
"S2 REAL ESTATE GROUP 5 LLC", "S2 REAL ESTATE GROUP 5 LLC", "S2 REAL ESTATE GROUP 5 LLC", 
"S2 REAL ESTATE", "S2 REAL EST GROUP"), OWNER_MAIL_ADDR = c("2045 PARADISE DR", 
"11512 W WOODSIDE DR", "2925 W LINCOLN AVE", "2925 W LINCOLN AVE", 
"2925 W LINCOLN AV", "2925 W LINCOLN AVE", "2925 W LINCOLN AV", 
"11512 W WOODSIDE DR", "2925 W LINCOLN AVE"), OWNER_CITY_STATE = c("WEST BEND, WI", 
"HALES CORNERS, WI", "MILWAUKEE, WI", "MILWAUKEE, WI", "MILWAUKEE, WI", 
"MILWAUKEE, WI", "MILWAUKEE, WI", "HALES CORNERS, WI", "MILWAUKEE, WI"
)), row.names = c(NA, -9L), class = "data.frame")
1赞 one 6/14/2023 #2

一种方法是迭代运行函数并显式设置最大迭代次数,这样可以避免无限循环。

run_myfunction_iteratively <- function(df,input,niter=1000){
  for(i in seq(niter)){
    message(paste0("iteration: ",i))
    if(i==1){
      old_out <- new_out <- connect_owners_by_address(input)
    }else{
      new_out <- connect_owners_by_address(new_out)
      if(length(old_out)==length(new_out)){
        return(new_out)
      }else{
        old_out <- new_out
      }
    }
  }
  warning(paste0("The function does not converge in ",niter," iterations."))
}

run_myfunction_iteratively(df,"SAM STAIR",niter=10)
iteration: 1
`summarise()` has grouped output by 'OWNER_MAIL_ADDR'. You can override using the `.groups`
argument.
iteration: 2
`summarise()` has grouped output by 'OWNER_MAIL_ADDR'. You can override using the `.groups`
argument.
iteration: 3
`summarise()` has grouped output by 'OWNER_MAIL_ADDR'. You can override using the `.groups`
argument.
[1] "S2 REAL EST GROUP"          "S2 REAL ESTATE"             "S2 REAL ESTATE GROUP"      
[4] "S2 REAL ESTATE GROUP 5 LLC" "SAM STAIR"  
1赞 Ricardo Semião e Castro 6/14/2023 #3

递归包装器:

connect_owners_by_address_recursive <- function(landlord_names, length.out){
  names.at.addresses <- connect_owners_by_address(landlord_names)
    
  if(length(names.at.addresses) < length.out){
    names.at.addresses <- connect_owners_by_address_recursive(names.at.addresses, length.out)
  }
  
  names.at.addresses
}

While 循环包装器:

connect_owners_by_address_while <- function(landlord_names, length.out){
  names.at.addresses <- connect_owners_by_address(landlord_names)
  
  while(length(names.at.addresses) < length.out){
    names.at.addresses <- connect_owners_by_address(names.at.addresses)
  }
  
  names.at.addresses
}

将原始函数转换为递归函数:

connect_owners_by_address <- function(landlord_names, length.out = NULL){
  # parcels owned by given landlord name(s)
  addresses1 <- df %>%
    filter(OWNER_NAME_1 %in% landlord_names) %>%
    group_by(OWNER_MAIL_ADDR, OWNER_CITY_STATE) %>%
    summarise() %>%
    ungroup()
  
  # all owner names at addresses associated with first name(s)
  names.at.addresses <- df %>%
    inner_join(addresses1, by = join_by(OWNER_CITY_STATE, OWNER_MAIL_ADDR)) %>%
    group_by(OWNER_NAME_1) %>%
    summarise() %>%
    pull(OWNER_NAME_1)
  
  if(!is.null(length.out) && length(names.at.addresses) < length.out){
    connect_owners_by_address(names.at.addresses)
  } else{
    names.at.addresses
  }
}

Obs:在我看来,使用使功能更干净。另外,如果您不想递归使用它,请不要传递 .pull()length.out