如何在 R 中保留其数值特征的同时对连续变量进行装箱?

How to bin a continuous variable while keeping its numeric feature in R?

提问人:Marco 提问时间:2/22/2023 更新时间:2/22/2023 访问量:43

问:

我喜欢在保留连续变量的同时将其装箱。有几个选项可以对变量中的连续变量进行自由化或分类,如下所示:numericnumericfactor

data(mtcars)

library(tidyverse)

mtcars <- mtcars %>% mutate(mpg_binned = cut_width(mpg, 2, closed = "right", boundary = 10))
as_tibble(mtcars %>% select(mpg, mpg_binned))

# A tibble: 32 × 2
     mpg mpg_binned
   <dbl> <fct>     
 1  21   (20,22]   
 2  21   (20,22]   
 3  22.8 (22,24]   
 4  21.4 (20,22]   
 5  18.7 (18,20]   
 6  18.1 (18,20]   
 7  14.3 (14,16]   
 8  24.4 (24,26]   
 9  22.8 (22,24]   
10  19.2 (18,20]   
# … with 22 more rows
# ℹ Use `print(n = ...)` to see more rows

但我喜欢用数字做各种图形和运算。因此,我喜欢将每个初始值转换为该区间的中心。第一个观测值仍然是 21,因为它是 (20,22) 的中间。四舍五入不起作用,因为第 7 行值 14.3 应变为 15((14,16] 的中间)。

r tidyverse 数据操作

评论


答:

2赞 Miff 2/22/2023 #1

您可以将列拆分为数字行并取平均值,如下所示:mpg_binned

mtcars$mid <- sapply(stringr::str_extract_all(mtcars$mpg_binned,"[0-9]+"), 
                     function(x){mean(as.numeric(x))})

评论

0赞 Marco 2/22/2023
我一直在寻找一种更直接的整洁方法,但解决方法可以完成这项工作。
1赞 Darren Tsai 2/22/2023
@Marco这种方法可以用 {tidyverse} 的意义上重写:mtcars %>% mutate(mid = map_dbl(str_extract_all(mpg_binned,"[0-9]+"), ~ mean(as.numeric(.x))))
2赞 Darren Tsai 2/22/2023 #2

您可以从 中提取下限和上限,并对它们进行平均。mpg_binnedtidyr::extract()

library(tidyverse)

mtcars %>%
  extract(mpg_binned, c("low", "up"), "(\\d+),(\\d+)", remove = FALSE, convert = TRUE) %>%
  mutate(mid = (low + up) / 2)

# # A tibble: 32 × 4
#    mpg_binned   low    up   mid
#    <fct>      <int> <int> <dbl>
#  1 (20,22]       20    22    21
#  2 (20,22]       20    22    21
#  3 (22,24]       22    24    23
#  4 (20,22]       20    22    21
#  5 (18,20]       18    20    19
#  6 (18,20]       18    20    19
# # … with 26 more rows