(R) 装箱数值列以计算分组后出现的次数

(R) Bin a numeric column to count occurrences after group by

提问人:ZainNST 提问时间:8/11/2023 更新时间:8/11/2023 访问量:40

问:

如果帖子的标题有点令人困惑,我们深表歉意。假设我有以下数据框:

set.seed(123)
test <- data.frame("chr" = rep("chr1",30), "position" = sample(c(1:50), 30, replace = F) , 
         "info" = sample(c("X","Y"), 30, replace = T), 
         "condition"= sample(c("soft","stiff"), 30, replace = T) )

## head(test)
   chr position info condition
1 chr1       31    Y      soft
2 chr1       15    Y      soft
3 chr1       14    X      soft
4 chr1        3    X      soft
5 chr1       42    X     stiff
6 chr1       43    X     stiff

我想将列装箱。假设尺寸为 10。然后根据条件(软或硬),我想计算列中的出现次数。因此,数据将如下所示(不是上述数据的实际结果)positioninfo

   chr start end condition count_Y count_X
1 chr1   1    10    soft      2       3
2 chr1   1    10    stiff     0       2
3 chr1   11   20    soft      2       5
4 chr1   11   20    soft      1       2
5 chr1   21   30    soft      2       0
6 chr1   21   30    stiff     0       4

为了方便起见,最好根据条件创建两个数据帧,然后应用装箱和计数,但我卡在了这部分。任何帮助都是值得赞赏的。非常感谢。

R 数据帧 分箱

评论

0赞 MrFlick 8/11/2023
所需的输出似乎与示例输入不匹配。测试确保它们匹配会很好。

答:

3赞 stefan 8/11/2023 #1

使用甚至更容易地使用整数除法进行分箱(Thx to @MrFlick 作为提示),您可以执行以下操作:cut%/%dplyr::counttidyr::pivot_wider

library(dplyr, warn=FALSE)
library(tidyr)

test |>
  mutate(
    bin = position %/% 10 + 1,
    start = (bin - 1) * 10 + 1,
    end = bin * 10
  ) |>
  count(chr, start, end, condition, info) |>
  tidyr::pivot_wider(
    names_from = info, 
    values_from = n, 
    names_prefix = "count_",
    values_fill = 0
  )
#> # A tibble: 9 × 6
#>   chr   start   end condition count_X count_Y
#>   <chr> <dbl> <dbl> <chr>       <int>   <int>
#> 1 chr1      1    10 soft            4       0
#> 2 chr1      1    10 stiff           2       1
#> 3 chr1     11    20 soft            3       3
#> 4 chr1     21    30 soft            1       1
#> 5 chr1     21    30 stiff           3       1
#> 6 chr1     31    40 soft            0       2
#> 7 chr1     31    40 stiff           2       1
#> 8 chr1     41    50 soft            0       1
#> 9 chr1     41    50 stiff           4       1
1赞 jkatam 8/11/2023 #2

或者,请检查以下代码方法base r

# Bin the "position" column with a bin size of 10
test$position_bin <- cut(test$position, breaks = seq(0, 50, by = 10), include.lowest = TRUE)

# Count occurrences in the "info" column based on the "condition"
count_result <- table(test$position_bin, test$condition, test$info) %>% as.data.frame() %>% 
  setNames(c('position_bin','condition','info','Freq')) %>% 
  reshape(idvar = c('position_bin','condition'), timevar = 'info', v.names = 'Freq', direction = 'wide')

创建于 2023-08-10 使用 reprex v2.0.2

   position_bin condition Freq.X Freq.Y
1        [0,10]      soft      4      0
2       (10,20]      soft      3      3
3       (20,30]      soft      1      1
4       (30,40]      soft      0      2
5       (40,50]      soft      0      1
6        [0,10]     stiff      2      1
7       (10,20]     stiff      0      0
8       (20,30]     stiff      3      1
9       (30,40]     stiff      2      1
10      (40,50]     stiff      4      1