如何创建包含 2 条线的折线图,其中 x 是年份,y 是占总数的比例而不是原始计数?

How do I create a line graph with 2 lines where x is year and y is proportion of total rather than raw count?

提问人:twistedgiraff3 提问时间:11/17/2023 最后编辑:twistedgiraff3 更新时间:11/17/2023 访问量:56

问:

我正在处理 2 种处方药的数据。有仿制药和名牌药。它们就像同一种药物,只是一种更便宜。我的数据按几个变量细分,包括季度、年份、该时间段内的处方数量和药物名称(作为字符变量)。我目前的代码如下。

alldrugs %>%
  filter(product_name == "DIMETHYL F" | product_name == "TECFIDERA ") %>%
  mutate(yr_q = yq(paste(year, quarter)), number_of_prescriptions = as.numeric(x = number_of_prescriptions)) %>%
  group_by(product_name, yr_q)%>%
  summarize(Prescription.count = sum(number_of_prescriptions)) %>%
  ggplot(aes(x = yr_q, y = Prescription.count)) + geom_line(aes(colour=product_name), size = 1.4) +
  xlab("2020-2022") + 
  ylab("Number of Prescriptions") + 
  labs(colour = "Generic vs Name Brand") +
  theme_bw()

What graph looks like

问题是,我需要 y 轴不是处方数量,而是该时间段内占总数的比例。即对于它应该代表的 tecfidera 系列(# 该时间段内的 tecfidera 处方/tecfidera 的总数 + 二甲基 f)。

如果您能帮我做类似的事情,但创建一个按州和年份细分比例(即 tecfidera/总 tec+ 二甲基)的表格,则加分。(我有状态变量)

我在想也许是某种类型的,但我认为这并不能解决问题。mutate(dummy.product_name = case_when(product_name == 'tecfidera' ~ 1, product_name == "dimethyl f" ~ 0))

或者折线图是错误的选择,我正在玩弄:

 ggplot(alldrugs, aes( x = year, y = number_of_prescriptions, fill = product_name)) +
 +    geom_bar(stat = "identity", width = .5, position = "dodge") +
+     facet_grid(~year)

但这真的很丑陋,尽管我认为它有潜力。

非常感谢您的任何帮助!

这是我认为是 25 个观察结果的子集的 dputstructure(list(state = c("AK", "CA", "OR", "MN", "NY", "UT", "AK", "AK", "CA", "NJ", "AK", "NY", "AK", "AK", "AK", "AK", "AK", "AK", "SC", "NC", "NM", "AK", "AK", "AK", "CA"), year = c("2020", "2020", "2020", "2020", "2021", "2021", "2021", "2021", "2022", "2022", "2022", "2022", "2021", "2020", "2021", "2021", "2020", "2020", "2021", "2021", "2021", "2021", "2022", "2022", "2022" ), quarter = c("1", "2", "3", "4", "1", "2", "3", "4", "1", "2", "3", "4", "1", "4", "1", "2", "3", "4", "1", "2", "3", "4", "1", "2", "3"), product_name = c("GILENYA ", "GILENYA ", "GILENYA ", "GILENYA ", "GILENYA ", "GILENYA ", "GILENYA ", "GILENYA ", "GILENYA ", "GILENYA ", "GILENYA ", "GILENYA ", "DIMETHYL F", "DIMETHYL F", "DIMETHYL F", "DIMETHYL F", "DIMETHYL F", "DIMETHYL F", "DIMETHYL F", "DIMETHYL F", "DIMETHYL F", "DIMETHYL F", "DIMETHYL F", "DIMETHYL F", "DIMETHYL F"), number_of_prescriptions = c("10", "1", "4", "6", "1", "7", "2", "9", "3", "7", "6", "4", "3", "2", "4", "9", "8", "2", "1", "10", "6", "9", "10", "3", "8")), row.names = c(NA, 25L), class = "data.frame")

https://docs.google.com/spreadsheets/d/1vNfqdm5gPXL21PXjkPe512ZFcBzU-qUI7baQACRO7ag/edit#gid=0

也链接到数据的电子表格。

r ggplot2 dplyr tidyverse

评论

1赞 neilfws 11/17/2023
欢迎使用 Stack Overflow。如果通过以纯文本格式包含一个小的代表性数据集来使此问题可重现,则更容易提供帮助 - 例如,如果输出不是太大,则更容易提供帮助。dput(alldrugs)
0赞 twistedgiraff3 11/17/2023
嗨,@neilfws感谢您的快速回复。它是一个大型数据集,大约有 5500 个观测值。有没有办法将它限制为仅抽出 20 个观察值以使其更短?
0赞 neilfws 11/17/2023
尝试。或者,您可以随时链接到在线文件(例如使用Google电子表格),以获取更大的数据集。dput(head(alldrugs, 20))
1赞 twistedgiraff3 11/17/2023
@neilfws 好的,我刚刚发布了数据的 dput 以及电子表格的链接!完整的数据集有更多的变量,但我只选择了 dput 中的相关变量。
0赞 Jon Spring 11/17/2023
是来自包装还是从哪里来的?yqdint

答:

1赞 Jon Spring 11/17/2023 #1
alldrugs %>%
  count(product_name, yr_q = lubridate::yq(paste(year, quarter)),
        wt = as.numeric(number_of_prescriptions)) %>%
  mutate(share = n / sum(n), .by = yr_q) %>%
  ggplot(aes(x = yr_q, y = share)) + 
  geom_line(aes(colour=product_name), size = 1.4) +
  xlab("2020-2022") + 
  ylab("Number of Prescriptions") + 
  labs(colour = "Generic vs Name Brand") +
  theme_bw()

enter image description here

评论

0赞 twistedgiraff3 11/17/2023
先生,你是个天才,谢谢你,这看起来正是我需要的,感谢 neilfws 帮助我使数据可重复。编辑:哦,在我回复后,您稍微更改了代码,我假设这更干净一些。