提问人:stats_noob 提问时间:3/8/2023 更新时间:3/8/2023 访问量:66
使用 JOIN 在 SQL 中添加缺失的行
Adding Missing Rows in SQL using JOINS
问:
我在 R 中有这个数据集:
name = c("john", "john", "john", "sarah", "sarah", "peter", "peter", "peter", "peter")
year = c(2010, 2011, 2014, 2010, 2015, 2011, 2012, 2013, 2015)
age = c(21, 22, 25, 55, 60, 61, 62, 63, 65)
gender = c("male", "male", "male", "female", "female", "male", "male", "male", "male" )
country_of_birth = c("australia", "australia", "australia", "uk", "uk", "mexico", "mexico", "mexico", "mexico")
source = "ORIGINAL"
my_data = data.frame(name, year, age, gender, country_of_birth, source)
正如我们所看到的,这个数据集中的一些人有缺少年份的行(例如,John 从 2011 年到 2014 年):
name year age gender country_of_birth source
1 john 2010 21 male australia ORIGINAL
2 john 2011 22 male australia ORIGINAL
3 john 2014 25 male australia ORIGINAL
4 sarah 2010 55 female uk ORIGINAL
5 sarah 2015 60 female uk ORIGINAL
6 peter 2011 61 male mexico ORIGINAL
7 peter 2012 62 male mexico ORIGINAL
8 peter 2013 63 male mexico ORIGINAL
9 peter 2015 65 male mexico ORIGINAL
我有这个代码,它能够通过“插值”缺失行的逻辑值来添加这些缺失的行(例如,年龄增加 1、country_of_birth保持不变等),并记录此行是后来添加的还是原始添加的:
library(tidyverse)
library(dplyr)
# R Code to Convert into SQL
final = my_data %>%
group_by(name) %>%
complete(year = first(year): last(year)) %>%
mutate(age = ifelse(is.na(age), first(age)+row_number()-1, age)) %>%
fill(c(gender, country_of_birth), .direction = "down") %>%
mutate(source = ifelse(is.na(source), "NOT ORIGINAL", source))
# A tibble: 16 x 6
# Groups: name [3]
name year age gender country_of_birth source
<chr> <dbl> <dbl> <chr> <chr> <chr>
1 john 2010 21 male australia ORIGINAL
2 john 2011 22 male australia ORIGINAL
3 john 2012 23 male australia NOT ORIGINAL
我的问题:我正在尝试学习如何将上述代码转换为(Netezza)SQL代码。
为了了解如何开始,我想我可以使用 R 中的“dbplyr”库将我的“dplyr”代码转换为 SQL:
library(dbplyr)
# attempt 1
remote_df = tbl_lazy(my_data, con = simulate_mysql())
remote_df %>%
group_by(name) %>%
complete(year = first(year): last(year)) %>%
mutate(age = ifelse(is.na(age), first(age)+row_number()-1, age)) %>%
fill(c(gender, country_of_birth), .direction = "down") %>%
mutate(source = ifelse(is.na(source), "MISSING", source)) %>% show_query()
# attempt 2
remote_df = tbl_lazy(my_data, con = simulate_mssql())
remote_df %>%
group_by(name) %>%
complete(year = first(year): last(year)) %>%
mutate(age = ifelse(is.na(age), first(age)+row_number()-1, age)) %>%
fill(c(gender, country_of_birth), .direction = "down") %>%
mutate(source = ifelse(is.na(source), "MISSING", source)) %>% show_query()
# attempt 3
con <- DBI::dbConnect(RSQLite::SQLite(), ":memory:")
remote_df <- copy_to(con, my_data)
remote_df %>%
group_by(name) %>%
complete(year = first(year): last(year)) %>%
mutate(age = ifelse(is.na(age), first(age)+row_number()-1, age)) %>%
fill(c(gender, country_of_birth), .direction = "down") %>%
mutate(source = ifelse(is.na(source), "NOT ORIGINAL", source))
# attempt 4
memdb_frame(my_data) %>%
group_by(name) %>%
complete(year = first(year): last(year)) %>%
mutate(age = ifelse(is.na(age), first(age)+row_number()-1, age)) %>%
fill(c(gender, country_of_birth), .direction = "down") %>%
mutate(source = ifelse(is.na(source), "MISSING", source)) %>% show_query()
但是,所有这些尝试都给了我相同的错误:
Error in `fill()`:
x `.data` does not have explicit order.
i Please use `arrange()` or `window_order()` to make determinstic.
Run `rlang::last_error()` to see where the error occurred.
有人可以告诉我我做错了什么,以及我可以做些什么来将此 R 代码转换为 SQL 代码吗?我曾想过,也许我可以找出哪个人缺少哪些行,创建这些行 - 然后以某种方式使用 JOINS 将它们带回 SQL 中的原始数据集。
谢谢!
答:
在 SQL 中,您可以通过 .我相信 R 中的等价物是 merge() 。对于缺少的年份,在 SQL 中,您将需要一个表或结果集,但您应该能够在 R 中使用序列cross join
合并年份序列和 merge() 函数:
library(tidyverse)
library(dplyr)
# Create a data frame with the sequence of years
years_df <- data.frame(year = seq(2010, 2023))
# Perform a cross join with the original data
final <- merge(my_data, years_df, all = TRUE) %>%
group_by(name) %>%
complete(year = first(year): last(year)) %>%
mutate(age = ifelse(is.na(age), first(age)+row_number()-1, age)) %>%
fill(c(gender, country_of_birth), .direction = "down") %>%
mutate(source = ifelse(is.na(source), "NOT ORIGINAL", source))
以上是未经测试的!
SQL代码:
CREATE TABLE mytable (
name VARCHAR(255),
year INTEGER,
age INTEGER,
gender VARCHAR(255),
country_of_birth VARCHAR(255),
source VARCHAR(255)
);
INSERT INTO mytable (name, year, age, gender, country_of_birth, source) VALUES ('john', 2010, 21, 'male', 'australia', 'ORIGINAL');
INSERT INTO mytable (name, year, age, gender, country_of_birth, source) VALUES ('john', 2011, 22, 'male', 'australia', 'ORIGINAL');
etc.
带有联接的示例查询,其中列出了所有年份,并在匹配时联接到数据:
WITH RECURSIVE years (year) AS (
SELECT 2010
UNION ALL
SELECT year + 1
FROM years
WHERE year < 2023
)
SELECT
t.name, years.year, t.age, t.gender, t.country_of_birth, t.source
FROM years
LEFT JOIN mytable AS t ON years.year = t.year
;
当我重新考虑这一点时,您不需要交叉连接。相反,你需要一个 或者它可以表示为 (在 SQL 中)。left join
left outer join
在 R 中,可以使用合并函数执行左联接。下面是一个示例:
# Create two example data frames
df1 <- data.frame(id = c(1, 2, 3), x = c("a", "b", "c"))
df2 <- data.frame(id = c(2, 3, 4), y = c("d", "e", "f"))
# Perform a left join
left_join <- merge(df1, df2, by = "id", all.x = TRUE)
# View the result
left_join
这将创建两个数据帧 df1 和 df2,然后使用 merge 函数对 id 列执行左连接。该参数指定,即使 df2 中没有匹配的行,结果中也应包含 df1 中的所有行。生成的数据框包含 df1 中的所有行以及 df2 中的任何匹配行。all.x = TRUE
评论
下一个:将字符串中的数字提取到新列中
评论