如何检查 R 中包含混合字母数字索引的列中的重叠

How to check overlap in a column that contains mixed letter-number index in R

提问人:Hong 提问时间:10/5/2023 最后编辑:Hong 更新时间:10/5/2023 访问量:87

问:

我有一个数据帧,其中一列包含由字母后跟 3 位数字组成的代码字符串。我想检查各行之间的代码是否存在重叠。

例如(请注意,row1 和 row2 之间有一个重叠 (B258):

df <- data.frame(ICD_code = c("A581-A589, B200-B299, B354-B355", "B258, I230-I233, J201-J300, K353", "C001-C005, C020-C023, C781"),
                 category = c(1, 2, 3))

df

                              ICD_code category
    1  A581-A589, B200-B299, B354-B355        1
    2 B258, I230-I233, J201-J300, K353        2
    3       C001-C005, C020-C023, C781        3

为此,我尝试制作一个函数来对“A581-A589”进行序列,增量为 1,但这不起作用,因为 seq() 仅适用于数值。我想知道是否有人知道如何使用字母数字值制作序列?

我想要的结果:

                          ICD_code category overlap
1  A581-A589, B200-B299, B354-B355        1    T
2 B258, I230-I233, J201-J300, K353        2    T
3       C001-C005, C020-C023, C781        3    F
r 字符串 序列 data.table

评论


答:

4赞 Nir Graham 10/5/2023 #1
df <- data.frame(
  ICD_code = c("A581-A589, B200-B299, B354-B355",
               "B258, I230-I233, J201-J300, K353",
               "C001-C005, C020-C023, C781"),
  category = c(1, 2, 3)
)
library(tidyverse)


(df2 <- separate_longer_delim(df,
  cols = ICD_code,
  delim = ","
) |>
  separate_wider_delim(
    cols = ICD_code, delim = "-",
    names_sep = "x",
    too_few = "align_start"
  ) |> mutate(
    across(1:2,trimws),
    across(1:2,
      parse_number,
      .names = "num_{col}"
    ),
    first_letter = substr(ICD_codex1, 1, 1)
  ) |> rowwise() |>
  mutate(numseq = list(seq(
    from = num_ICD_codex1,
    to = pmax(num_ICD_codex1, num_ICD_codex2, na.rm = TRUE),
    by = 1))))

(df3 <- split(df2, ~first_letter))

(df4 <- map_lgl(
  df3,
  \(x){anyDuplicated(unlist(x$numseq)) > 0}
))

评论

0赞 Hong 10/5/2023
嗨,我刚刚在我自己的数据上运行了代码,我相信它只根据数字识别重复项,而不是字母,即它将 A200 识别为 D200 的副本,而我只想查找行之间是否有字母和数字重叠(即 A200-A100 的另一行, 和 D150 在 D125-D187 的另一行)等
1赞 DavoOZ 10/5/2023
拆分需要基于 ,并且需要识别重叠的重复项。这个编辑几乎可以工作,但我无法调试它:category(df3 <- split(df2, ~category)) library(questionr) (df4 <- map_lgl( df3, \(x){any(duplicated2(unlist(x$numseq)))} ))
0赞 Nir Graham 10/5/2023
@Hong第一个 split() 是任何一对的首字母代码,因此只有常见的字母代码被相互比较。看看 DF3 就知道了;因此 A200 不会与 D200 重叠
3赞 ismirsehregal 10/5/2023 #2

版本:

library(data.table)

DF <- data.frame(ICD_code = c("A581-A589, B200-B299, B354-B355",
                              "B258, I230-I233, J201-J300, K353",
                              "C001-C005, C020-C023, C781"),
                 category = c(1, 2, 3))

setDT(DF)
DT <- copy(DF)
DT <- DT[, .(ICD_code = unlist(strsplit(ICD_code, ", "))), by = category][, prefix := substring(ICD_code, 1, 1)][, suffix := lapply(strsplit(ICD_code, "-"), function(x){if(length(x) > 1){do.call(seq, as.list(as.integer(substring(x, 2))))} else {as.numeric(substring(x, 2))}})]
DT <- DT[, .(suffix = as.character(unlist(suffix))), by = .(category, ICD_code, prefix)][, ICD_code := paste0(prefix, suffix)][, overlap := (duplicated(ICD_code) | duplicated(ICD_code, fromLast=TRUE))]
DF <- merge.data.table(DF, DT[overlap == TRUE, .(category, overlap)], by = "category", all.x = TRUE)[is.na(overlap), overlap := FALSE]
DF[]

结果:

   category                         ICD_code overlap
1:        1  A581-A589, B200-B299, B354-B355    TRUE
2:        2 B258, I230-I233, J201-J300, K353    TRUE
3:        3       C001-C005, C020-C023, C781   FALSE
0赞 Juan P FZ 10/5/2023 #3

这是我的代码:

    df <- data.frame(ICD_code = c("A581-A589, B200-B299, B354-B355", "B258, I230-I233, J201-J300, K353", "C001-C005, C020-C023, C781"),
                 category = c(1, 2, 3))

    df

首先,确定要扩展的不同范围

    library(tidyverse)
    
    
    exp_df = df |> 
      rowid_to_column(var = "original_row") |> 
      separate_longer_delim(cols = ICD_code,delim = ",") |> 
      mutate(ICD_code = str_trim(ICD_code),
             expand_check = str_detect(ICD_code,"-"),
             ICD_code_exp_code = ifelse(expand_check,
                                        str_extract(ICD_code,"[A-Z]"),
                                        "A"), 
             ICD_code_exp_min = ifelse(expand_check,
                                       str_extract(ICD_code,"\\d*(?=-)"),
                                       0),
             ICD_code_exp_max = ifelse(expand_check,
                                       str_extract(ICD_code,"(?<=-[A-Z])\\d*"),
                                       0)) |> 
      mutate(across(c(ICD_code_exp_min,ICD_code_exp_max),as.integer)) 

这将扩展范围并将它们存储在列中

    # Producing a expanded list of codes
    exp_df$ICD_code_exp = sapply(1:nrow(exp_df),
                                 FUN = function(x){
                                   paste(paste0(exp_df$ICD_code_exp_code[x],
                                                seq(exp_df$ICD_code_exp_min[x],
                                                    exp_df$ICD_code_exp_max[x],
                                                    1)),
                                         collapse = ",")
                                   }
        )
    
    # Replacing the ones that do not need to be expanded
    exp_df$ICD_code_exp[!exp_df$expand_check] = exp_df$ICD_code[!exp_df$expand_check]

通过计数出现的不同行来检测重叠。ICD_code

    overlaps = exp_df |>
      separate_longer_delim(ICD_code_exp,delim = ",") |> 
      mutate(n_dist_rows = n_distinct(original_row),.by = ICD_code_exp) |> 
      summarise(overlap = max(n_dist_rows)>1,
                .by = c(original_row))

输出将是

    df |> 
      rowid_to_column(var = "original_row") |>
      left_join(overlaps,by = join_by(original_row)) |
      select(-original_row)

      #                           ICD_code category overlap
      # 1  A581-A589, B200-B299, B354-B355        1    TRUE
      # 2 B258, I230-I233, J201-J300, K353        2    TRUE
      # 3       C001-C005, C020-C023, C781        3   FALSE