提问人:rmc 提问时间:11/15/2023 更新时间:11/15/2023 访问量:33
3 个分类变量的简洁可视化(最多 5 个!
Neat visualization of 3 categorical variables (up to even 5!)
问:
可视化 3 个分类变量(每个变量超过 10 个级别)的巧妙方法是绘制一个堆叠条形图,显示 var2 和 var3 的每个组合的 var1 中级别的(加权)比例。您将拥有一个网格,其中的单元格数量等于 ,并且颜色数量与 相同。length(levels(var2)) x length(levels(var3))
length(levels(var1))
我们将这些变量称为 fct1、fct2、fct3。一个简单的解决方案是这样的:
data <- tibble(
a = c(5, 6, 7, 12, 5, 6, 7),
fct1 = paste0('type',c("a","b","c","d", "a","b","c")),
fct2 = paste0('lvl',c(1,1,1,1,2,2,2)),
fct3 = paste0('system', c(1,2,2,2,1,2,2)),
) %>%
crossing(fct2_suffix = 0:4, fct3_suffix = 0:9) %>%
mutate(
fct2 = paste0(fct2, fct2_suffix),
fct3 = paste0(fct3, fct3_suffix)
) %>%
select(-c(fct2_suffix, fct3_suffix)) %>%
uncount(a)
data %>%
ggplot() +
geom_bar(aes(y = 0, fill=fct1), position = "fill") +
facet_grid(fct3~fct2)
但是,对于许多级别,刻面速度很慢。我想制作这样的图表,同时完全避免使用分面(这也会让它留给潜在的第 4 和第 5 个猫变量)。
我想有一个 geom_*() 函数,以便它更灵活,但我不知道从哪里开始。
理想情况下,它看起来像:
data %>%
ggplot() +
geom_col_grid(aes(x=fct1, y=fct2, fill=fct3), position = "fill")# +
#facet_grid(fct4~fct5) #potentially 4th and 5th var
我编写了一个函数,用于手动计算每个柱线开始和结束的位置,然后将其传递给 .这可行,只是不如几何图形灵活。下面是代码(请注意,还有一个填充参数来定义条形之间的距离)。var4 和 var5 用于分面(可以留空)。geom_rect()
plot_crosstab <- function(data, var1, var2, var3, var4, var5, padding = 0.1){
if(!("weight" %in% names(data))){
data <- data %>% mutate(weight = 1)
cli::cli_alert_info("No variable 'weight' in data: assumed equal weights")
}
if(missing(var4)) {
var4 <- quo(var4)
data <- data %>% mutate(var4 = "total")
}
if(missing(var5)){
var5 <- quo(var5)
data <- data %>% mutate(var5 = "total")
}
build_data =
data %>%
mutate(across(c({{var5}}, {{var4}}, {{var3}}, {{var2}}, {{var1}}), as.factor)) %>%
group_by(across(c({{var5}}, {{var4}}, {{var3}}, {{var2}}, {{var1}}))) %>%
summarise(
n = sum(weight, na.rm = T)
) %>%
mutate(
frac = n/sum(n, na.rm = T)*(1-padding) #so that it spans the right amount
) %>%
arrange(desc(frac)) %>%
ungroup() %>%
complete({{var5}}, {{var4}}, {{var3}}, {{var2}}, fill = list(n=0, frac=1-padding)) %>%
group_by(across(c({{var5}}, {{var4}}, {{var3}}, {{var2}}))) %>%
mutate(
v_padding = if_else(row_number()==1, padding, 0)
) %>%
group_by(across(c({{var5}}, {{var4}}, {{var3}}))) %>%
mutate(
pos_left = -0.5 -(padding/2) + lag(cumsum(frac), default = 0) + cumsum(v_padding),
pos_right = -0.5 -(padding/2) + cumsum(frac) + cumsum(v_padding)
) %>%
ungroup() %>%
mutate(
pos_low = as.numeric(factor({{var3}})) + (padding/2),
pos_high = pos_low + (1-padding)
)
build_data %>%
ggplot() +
geom_rect(aes(xmin = pos_left, xmax = pos_right, ymin = pos_low, ymax = pos_high, fill = {{var1}})) +
facet_grid(rows = vars({{var4}}), cols = vars({{var5}})) +
scale_x_continuous(breaks = 1:length(levels(build_data %>% pull({{var2}})))-1, labels = levels(build_data %>% pull({{var2}}))) +
scale_y_continuous(breaks = 1:(length(levels(build_data %>% pull({{var3}}))))+0.5, labels = levels(build_data %>% pull({{var3}})))
}
data %>%
plot_crosstab(fct2, fct1, fct3)
这看起来与原始图表相似,但速度要快得多。但是,它没有集成到 ggplot2 工作流中。
答: 暂无答案
下一个:小提琴图显示其他数据点
评论