提问人:stats_noob 提问时间:10/27/2022 最后编辑:stats_noob 更新时间:10/27/2022 访问量:81
R/SQL 中每个组的累积差异
Cumulative Differences Per Group in R/SQL
问:
我在 R/SQL Server 上有这个数据集:
name year
1 john 2010
2 john 2011
3 john 2013
4 jack 2015
5 jack 2018
6 henry 2010
7 henry 2011
8 henry 2012
我正在尝试添加两列:
- 第 1 列:查看每个人的“连续行之间的缺失年数”。
- 第2栏:将每个人的累计“失踪年数”相加
例如 - 每个人的第一个实例将为 0,然后:
# note: in this specific example that I have created, "missing_ years" is the same as the "cumulative_missing_years"
name year missing_years cumulative_missing_years
1 john 2010 0 0
2 john 2011 0 0
3 john 2013 1 1
4 jack 2015 0 0
5 jack 2018 3 3
6 henry 2010 0 0
7 henry 2011 0 0
8 henry 2012 0 0
我认为这可以通过“分组累积差异”和“分组累积总和”来完成:
library(dplyr)
library(DBI)
con <- dbConnect(RSQLite::SQLite(), ":memory:")
# https://stackoverflow.com/questions/30606360/subtract-value-from-previous-row-by-group
final = my_data %>%
group_by(name) %>%
arrange(year) %>%
mutate(missing_year) = year- lag(year, default = first(year)) %>%
mutate(cumulative_missing_years) = mutate( cumulative_missing_years = cumsum(cs))
但我不确定我这样做是否正确。
理想情况下,我正在寻找可用于与数据集交互的 SQL 方法或 R 方法(例如通过 DBPLYR)。
有人可以建议一种方法吗?
谢谢!
答:
2赞
G. Grothendieck
10/27/2022
#1
使用末尾注释中的数据进行左自联接,得到同名的下一年,然后减去累计总和。
library(sqldf)
sqldf("select a.*,
coalesce(min(b.year) - a.year - 1, 0) as missing,
sum(coalesce(min(b.year) - a.year - 1, 0)) over
(partition by a.name order by a.year) as sum
from DF a
left join DF b on a.name = b.name and a.year < b.year
group by a.name, a.year
order by a.name, a.year")
给:
name year missing sum
1 henry 2010 0 0
2 henry 2011 0 0
3 henry 2012 0 0
4 jack 2015 2 2
5 jack 2018 0 2
6 john 2010 0 0
7 john 2011 1 1
8 john 2013 0 1
注意
Lines <- "name year
1 john 2010
2 john 2011
3 john 2013
4 jack 2015
5 jack 2018
6 henry 2010
7 henry 2011
8 henry 2012
"
DF <- read.table(text = Lines)
评论
0赞
stats_noob
10/27/2022
@G. Grothendieck:非常感谢!此代码是否也可以在 sqldf 库外部工作?例如,使用 DBI 库,还是 SQL Server 本身?
0赞
G. Grothendieck
10/27/2022
数据库之间可能存在差异,但我试图使其尽可能通用,以便它可能适用于其他数据库,但您必须尝试才能真正了解。SQL 语句中没有特定于 sqldf 的内容。
1赞
Ahad Zaman
10/27/2022
#2
我希望这会有所帮助
name <- c(rep("John", 3), rep("jack", 2), rep("henry", 3) )
year <- c(2010, 2011, 2013, 2015, 2018, 2010, 2011, 2012)
dt <- data.frame(name = name, year = year)
# first group the data by name then order by year then mutate
dt <- dt %>%
group_by(name) %>%
arrange(year, .by_group = TRUE) %>%
mutate( mis_yr = if_else(is.na(year - lag(year, n = 1L) -1), 0,
year - lag(year, n = 1L) -1) ,
cum_yr = cumsum(mis_yr)
) %>%
ungroup()
野兔是结果
name year mis_yr cum_yr
<chr> <dbl> <dbl> <dbl>
1 henry 2010 0 0
2 henry 2011 0 0
3 henry 2012 0 0
4 jack 2015 0 0
5 jack 2018 2 2
6 John 2010 0 0
7 John 2011 0 0
8 John 2013 1 1
评论
mutate(missing_year) = year- lag(year, default = first(year))
是一个明显的错误和/或解析错误。仅供参考,可以替换为我认为。year - lag(year,...)
c(year[1], diff(year))