使用 JOIN 在 SQL 中添加缺失的行

Adding Missing Rows in SQL using JOINS

提问人:stats_noob 提问时间:3/8/2023 更新时间:3/8/2023 访问量:66

问:

我在 R 中有这个数据集:

name = c("john", "john", "john", "sarah", "sarah", "peter", "peter", "peter", "peter")
year = c(2010, 2011, 2014, 2010, 2015, 2011, 2012, 2013, 2015)
age = c(21, 22, 25, 55, 60, 61, 62, 63, 65)
gender = c("male", "male", "male", "female", "female", "male", "male", "male", "male" )
country_of_birth = c("australia", "australia", "australia", "uk", "uk", "mexico", "mexico", "mexico", "mexico")
source = "ORIGINAL"

my_data = data.frame(name, year, age, gender, country_of_birth, source)

正如我们所看到的,这个数据集中的一些人有缺少年份的行(例如,John 从 2011 年到 2014 年):

   name year age gender country_of_birth   source
1  john 2010  21   male        australia ORIGINAL
2  john 2011  22   male        australia ORIGINAL
3  john 2014  25   male        australia ORIGINAL
4 sarah 2010  55 female               uk ORIGINAL
5 sarah 2015  60 female               uk ORIGINAL
6 peter 2011  61   male           mexico ORIGINAL
7 peter 2012  62   male           mexico ORIGINAL
8 peter 2013  63   male           mexico ORIGINAL
9 peter 2015  65   male           mexico ORIGINAL

我有这个代码,它能够通过“插值”缺失行的逻辑值来添加这些缺失的行(例如,年龄增加 1、country_of_birth保持不变等),并记录此行是后来添加的还是原始添加的:

library(tidyverse)
library(dplyr)

# R Code to Convert into SQL
final = my_data %>% 
    group_by(name) %>% 
    complete(year = first(year): last(year)) %>% 
    mutate(age = ifelse(is.na(age), first(age)+row_number()-1, age)) %>% 
    fill(c(gender, country_of_birth), .direction = "down") %>% 
    mutate(source = ifelse(is.na(source), "NOT ORIGINAL", source))

# A tibble: 16 x 6
# Groups:   name [3]
   name   year   age gender country_of_birth source      
   <chr> <dbl> <dbl> <chr>  <chr>            <chr>       
 1 john   2010    21 male   australia        ORIGINAL    
 2 john   2011    22 male   australia        ORIGINAL    
 3 john   2012    23 male   australia        NOT ORIGINAL

我的问题:我正在尝试学习如何将上述代码转换为(Netezza)SQL代码。

为了了解如何开始,我想我可以使用 R 中的“dbplyr”库将我的“dplyr”代码转换为 SQL:

library(dbplyr)

# attempt 1
remote_df = tbl_lazy(my_data, con = simulate_mysql())

remote_df %>% 
    group_by(name) %>% 
    complete(year = first(year): last(year)) %>% 
    mutate(age = ifelse(is.na(age), first(age)+row_number()-1, age)) %>% 
    fill(c(gender, country_of_birth), .direction = "down") %>% 
    mutate(source = ifelse(is.na(source), "MISSING", source))  %>% show_query()

# attempt 2
remote_df = tbl_lazy(my_data, con = simulate_mssql())

remote_df %>% 
    group_by(name) %>% 
    complete(year = first(year): last(year)) %>% 
    mutate(age = ifelse(is.na(age), first(age)+row_number()-1, age)) %>% 
    fill(c(gender, country_of_birth), .direction = "down") %>% 
    mutate(source = ifelse(is.na(source), "MISSING", source))  %>% show_query()

# attempt 3

 con <- DBI::dbConnect(RSQLite::SQLite(), ":memory:")

 remote_df <- copy_to(con, my_data)

remote_df %>% 
    group_by(name) %>% 
    complete(year = first(year): last(year)) %>% 
    mutate(age = ifelse(is.na(age), first(age)+row_number()-1, age)) %>% 
    fill(c(gender, country_of_birth), .direction = "down") %>% 
    mutate(source = ifelse(is.na(source), "NOT ORIGINAL", source))



# attempt 4

memdb_frame(my_data) %>% 
    group_by(name) %>% 
    complete(year = first(year): last(year)) %>% 
    mutate(age = ifelse(is.na(age), first(age)+row_number()-1, age)) %>% 
    fill(c(gender, country_of_birth), .direction = "down") %>% 
    mutate(source = ifelse(is.na(source), "MISSING", source))  %>% show_query()

但是,所有这些尝试都给了我相同的错误:

Error in `fill()`:
x `.data` does not have explicit order.
i Please use `arrange()` or `window_order()` to make determinstic.
Run `rlang::last_error()` to see where the error occurred.

有人可以告诉我我做错了什么,以及我可以做些什么来将此 R 代码转换为 SQL 代码吗?我曾想过,也许我可以找出哪个人缺少哪些行,创建这些行 - 然后以某种方式使用 JOINS 将它们带回 SQL 中的原始数据集。

谢谢!

SQL R 联接 数据操作

评论


答:

2赞 Adrian Maxwell 3/8/2023 #1

在 SQL 中,您可以通过 .我相信 R 中的等价物是 merge() 。对于缺少的年份,在 SQL 中,您将需要一个表或结果集,但您应该能够在 R 中使用序列cross join

合并年份序列和 merge() 函数:

library(tidyverse)
library(dplyr)

# Create a data frame with the sequence of years
years_df <- data.frame(year = seq(2010, 2023))

# Perform a cross join with the original data
final <- merge(my_data, years_df, all = TRUE) %>% 
    group_by(name) %>% 
    complete(year = first(year): last(year)) %>% 
    mutate(age = ifelse(is.na(age), first(age)+row_number()-1, age)) %>% 
    fill(c(gender, country_of_birth), .direction = "down") %>% 
    mutate(source = ifelse(is.na(source), "NOT ORIGINAL", source))

以上是未经测试的!

SQL代码:

CREATE TABLE mytable (
    name VARCHAR(255),
    year INTEGER,
    age INTEGER,
    gender VARCHAR(255),
    country_of_birth VARCHAR(255),
    source VARCHAR(255)
);

INSERT INTO mytable (name, year, age, gender, country_of_birth, source) VALUES ('john', 2010, 21, 'male', 'australia', 'ORIGINAL');
INSERT INTO mytable (name, year, age, gender, country_of_birth, source) VALUES ('john', 2011, 22, 'male', 'australia', 'ORIGINAL');
etc.

带有联接的示例查询,其中列出了所有年份,并在匹配时联接到数据:

WITH RECURSIVE years (year) AS (
    SELECT 2010
    UNION ALL
    SELECT year + 1
    FROM years
    WHERE year < 2023
)
SELECT 
    t.name, years.year, t.age, t.gender, t.country_of_birth, t.source
FROM years
LEFT JOIN mytable AS t ON years.year = t.year
;

当我重新考虑这一点时,您不需要交叉连接。相反,你需要一个 或者它可以表示为 (在 SQL 中)。left joinleft outer join

在 R 中,可以使用合并函数执行左联接。下面是一个示例:

# Create two example data frames
df1 <- data.frame(id = c(1, 2, 3), x = c("a", "b", "c"))
df2 <- data.frame(id = c(2, 3, 4), y = c("d", "e", "f"))

# Perform a left join
left_join <- merge(df1, df2, by = "id", all.x = TRUE)

# View the result
left_join

这将创建两个数据帧 df1 和 df2,然后使用 merge 函数对 id 列执行左连接。该参数指定,即使 df2 中没有匹配的行,结果中也应包含 df1 中的所有行。生成的数据框包含 df1 中的所有行以及 df2 中的任何匹配行。all.x = TRUE

评论

0赞 stats_noob 3/8/2023
@Paul Maxwell:非常感谢您的回答!您能告诉我如何将您的代码转换为 SQL 代码吗?非常感谢!
0赞 Adrian Maxwell 3/8/2023
不熟悉 Netazza,但以上内容相当“通用”,希望能按原样工作
0赞 Adrian Maxwell 3/8/2023
嗯,因为不是正确的,请忽略 CROSS JOIN - 对不起 - 您需要一种稍微不同的方法。给我一会儿
0赞 stats_noob 3/8/2023
@Paul Maxwell:非常感谢您的更新!如果你有时间,你认为你可以继续SQL代码吗?
0赞 Adrian Maxwell 3/8/2023
SQL已经存在(我删除了交叉连接,现在是左连接)。