如何根据列替换两行之间的整行-解网

问：

我正在处理一个非常大的 mRNA 剪接数据集。下面是一个玩具数据集来举例说明这个问题：

test_df <- data.frame(
  start = c(2, 9, 13, 19, 13, 20, 25, 35, 39),
  end = c(8, 12, 18, 24, 16, 24, 30, 38, 45),
  gene_id = c("A", "A", "A", "A", "A", "B", "B", "B", "B"),
  exon_identity = c(NA, "Upstream", NA, "Downstream", "Event", NA, "Upstream", "Downstream", NA)
)

> test_df
  start end gene_id exon_identity
1     2   8       A          <NA>
2     9  12       A      Upstream
3    13  18       A          <NA>
4    19  24       A    Downstream
5    13  16       A         Event
6    20  24       B          <NA>
7    25  30       B      Upstream
8    35  38       B    Downstream
9    39  45       B          <NA>

对于列中的每个唯一值，如果它存在于列中的“上游”和“下游”值之间，我想替换整行，即用第 3 行替换第 5 行。让我感到困难的是，列中的某些基因没有需要替换的行，例如列中的“B”。gene_idexon_identitygene_idgene_id

这个问题朝着前面提出的问题的方向发展这里和这里.

基于这些资源和其他资源，我尝试了：

library(tidyverse)

test_replace <- test_df %>% 
  group_by(gene_id) %>% 
  mutate(start = replace(start, row_number() > which(exon_idnetity == "Upstream") & row_number() < which(exon_idnetity == "Downstream"), start[exon_idnetity == "Event"]),
         end = replace(end, row_number() > which(exon_idnetity == "Upstream") & row_number() < which(exon_idnetity == "Downstream"), end[exon_idnetity == "Event"]),
         exon_idnetity = replace(exon_idnetity, row_number() > which(exon_idnetity == "Upstream") & row_number() < which(exon_idnetity == "Downstream"), "Event")
         )


Warning message:
There were 2 warnings in `mutate()`.
The first warning was:
ℹ In argument: `start = replace(...)`.
ℹ In group 1: `gene_id = "A"`.
Caused by warning in `x[list] <- values`:
! number of items to replace is not a multiple of replacement length
ℹ Run dplyr::last_dplyr_warnings() to see the 1 remaining warning. 
> 
> test_replace
# A tibble: 9 × 4
# Groups:   gene_id [2]
  start   end gene_id exon_idnetity
  <dbl> <dbl> <chr>   <chr>        
1     2     8 A       NA           
2     9    12 A       Upstream     
3    NA    NA A       Event        
4    19    24 A       Downstream   
5    13    16 A       Event        
6    20    24 B       NA           
7    25    30 B       Upstream     
8    35    38 B       Downstream   
9    39    45 B       NA

期望输出：


> desired_outcome 
  start end gene_id exon_idnetity
1     2   8       A          <NA>
2     9  12       A      Upstream
3    13  16       A         Event
4    19  24       A    Downstream
5    20  24       B          <NA>
6    25  30       B      Upstream
7    35  38       B    Downstream
8    39  45       B          <NA>

最好使用 tidyverse 包的解决方案将不胜感激。

谢谢！

r dplyr 操作数据清理

library(dplyr)
library(tidyr)

replace_rows <- function(x, ...) {
  is_bad <- replace_na(
    lag(x$exon_identity) == "Upstream" & lead(x$exon_identity) == "Downstream",
    FALSE
  )
  if (any(is_bad)) {
    is_event <- replace_na(x$exon_identity == "Event", FALSE)
    x <- x %>%
      filter(!is_bad, !is_event) %>%
      add_row(
        filter(x, is_event),
        .before = which(is_bad)
      )
  }
  x
}

test_df %>% 
  group_by(gene_id) %>% 
  group_modify(replace_rows) %>%
  ungroup()

# A tibble: 8 × 4
  gene_id start   end exon_identity
  <chr>   <dbl> <dbl> <chr>        
1 A           2     8 <NA>         
2 A           9    12 Upstream     
3 A          13    16 Event        
4 A          19    24 Downstream   
5 B          20    24 <NA>         
6 B          25    30 Upstream     
7 B          35    38 Downstream   
8 B          39    45 <NA>

上一个：按值对变量进行分组，彼此为 %

下一个：根据大型、不整洁的数据集的列中的先前条目提取特定行

如何根据列替换两行之间的整行

How to replace an entire row between two rows based on a column

评论

评论