3 个分类变量的简洁可视化(最多 5 个!

Neat visualization of 3 categorical variables (up to even 5!)

提问人:rmc 提问时间:11/15/2023 更新时间:11/15/2023 访问量:33

问:

可视化 3 个分类变量(每个变量超过 10 个级别)的巧妙方法是绘制一个堆叠条形图,显示 var2 和 var3 的每个组合的 var1 中级别的(加权)比例。您将拥有一个网格,其中的单元格数量等于 ,并且颜色数量与 相同。length(levels(var2)) x length(levels(var3))length(levels(var1))

我们将这些变量称为 fct1、fct2、fct3。一个简单的解决方案是这样的:

data <- tibble(
  a = c(5, 6, 7, 12, 5, 6, 7),
  fct1 = paste0('type',c("a","b","c","d", "a","b","c")),
  fct2 = paste0('lvl',c(1,1,1,1,2,2,2)),
  fct3 = paste0('system', c(1,2,2,2,1,2,2)),
) %>% 
  crossing(fct2_suffix = 0:4, fct3_suffix = 0:9) %>% 
  mutate(
    fct2 = paste0(fct2, fct2_suffix),
    fct3 = paste0(fct3, fct3_suffix)
  ) %>% 
  select(-c(fct2_suffix, fct3_suffix)) %>% 
  uncount(a)

data %>% 
  ggplot() +
  geom_bar(aes(y = 0, fill=fct1), position = "fill") +
  facet_grid(fct3~fct2)

使用分面的条形图网格

但是,对于许多级别,刻面速度很慢。我想制作这样的图表,同时完全避免使用分面(这也会让它留给潜在的第 4 和第 5 个猫变量)。

我想有一个 geom_*() 函数,以便它更灵活,但我不知道从哪里开始。

理想情况下,它看起来像:

data %>% 
  ggplot() +
  geom_col_grid(aes(x=fct1, y=fct2, fill=fct3), position = "fill")# +
  #facet_grid(fct4~fct5) #potentially 4th and 5th var 

我编写了一个函数,用于手动计算每个柱线开始和结束的位置,然后将其传递给 .这可行,只是不如几何图形灵活。下面是代码(请注意,还有一个填充参数来定义条形之间的距离)。var4 和 var5 用于分面(可以留空)。geom_rect()

plot_crosstab <- function(data, var1, var2, var3, var4, var5, padding = 0.1){
  
  if(!("weight" %in% names(data))){
    data <- data %>% mutate(weight = 1) 
    cli::cli_alert_info("No variable 'weight' in data: assumed equal weights")
  }
  if(missing(var4)) {
    var4 <- quo(var4)
    data <- data %>% mutate(var4 = "total") 
  }
  if(missing(var5)){
    var5 <- quo(var5)
    data <- data %>% mutate(var5 = "total") 
  } 

  build_data = 
    data %>%
    mutate(across(c({{var5}}, {{var4}}, {{var3}}, {{var2}}, {{var1}}), as.factor)) %>% 
    group_by(across(c({{var5}}, {{var4}}, {{var3}}, {{var2}}, {{var1}}))) %>% 
    summarise(
      n = sum(weight, na.rm = T)
    ) %>% 
    mutate(
      frac = n/sum(n, na.rm = T)*(1-padding) #so that it spans the right amount
    ) %>% 
    arrange(desc(frac)) %>% 
    ungroup() %>% 
    complete({{var5}}, {{var4}}, {{var3}}, {{var2}}, fill = list(n=0, frac=1-padding)) %>% 
    group_by(across(c({{var5}}, {{var4}}, {{var3}}, {{var2}}))) %>%
    mutate(
      v_padding = if_else(row_number()==1, padding, 0)
    ) %>% 
    group_by(across(c({{var5}}, {{var4}}, {{var3}}))) %>% 
    mutate(
      pos_left = -0.5 -(padding/2) + lag(cumsum(frac), default = 0) + cumsum(v_padding),
      pos_right = -0.5 -(padding/2) + cumsum(frac) + cumsum(v_padding)
    ) %>% 
    ungroup() %>% 
    mutate(
      pos_low = as.numeric(factor({{var3}})) + (padding/2),
      pos_high = pos_low + (1-padding)
    )

  build_data %>% 
    ggplot() +
    geom_rect(aes(xmin = pos_left, xmax = pos_right, ymin = pos_low, ymax = pos_high, fill = {{var1}})) +
    facet_grid(rows = vars({{var4}}), cols = vars({{var5}})) +
    scale_x_continuous(breaks = 1:length(levels(build_data %>% pull({{var2}})))-1, labels = levels(build_data %>% pull({{var2}}))) +
    scale_y_continuous(breaks = 1:(length(levels(build_data %>% pull({{var3}}))))+0.5, labels = levels(build_data %>% pull({{var3}})))
}

data %>% 
  plot_crosstab(fct2, fct1, fct3)

这看起来与原始图表相似,但速度要快得多。但是,它没有集成到 ggplot2 工作流中。

条形图网格不使用分面

r ggplot2 分类数据 geom-bar facet-grid

评论


答: 暂无答案