使用 JOIN 在 SQL 中添加缺失的行-解网

问：

我在 R 中有这个数据集：

name = c("john", "john", "john", "sarah", "sarah", "peter", "peter", "peter", "peter")
year = c(2010, 2011, 2014, 2010, 2015, 2011, 2012, 2013, 2015)
age = c(21, 22, 25, 55, 60, 61, 62, 63, 65)
gender = c("male", "male", "male", "female", "female", "male", "male", "male", "male" )
country_of_birth = c("australia", "australia", "australia", "uk", "uk", "mexico", "mexico", "mexico", "mexico")
source = "ORIGINAL"

my_data = data.frame(name, year, age, gender, country_of_birth, source)

正如我们所看到的，这个数据集中的一些人有缺少年份的行（例如，John 从 2011 年到 2014 年）：

   name year age gender country_of_birth   source
1  john 2010  21   male        australia ORIGINAL
2  john 2011  22   male        australia ORIGINAL
3  john 2014  25   male        australia ORIGINAL
4 sarah 2010  55 female               uk ORIGINAL
5 sarah 2015  60 female               uk ORIGINAL
6 peter 2011  61   male           mexico ORIGINAL
7 peter 2012  62   male           mexico ORIGINAL
8 peter 2013  63   male           mexico ORIGINAL
9 peter 2015  65   male           mexico ORIGINAL

我有这个代码，它能够通过“插值”缺失行的逻辑值来添加这些缺失的行（例如，年龄增加 1、country_of_birth保持不变等），并记录此行是后来添加的还是原始添加的：

library(tidyverse)
library(dplyr)

# R Code to Convert into SQL
final = my_data %>% 
    group_by(name) %>% 
    complete(year = first(year): last(year)) %>% 
    mutate(age = ifelse(is.na(age), first(age)+row_number()-1, age)) %>% 
    fill(c(gender, country_of_birth), .direction = "down") %>% 
    mutate(source = ifelse(is.na(source), "NOT ORIGINAL", source))

# A tibble: 16 x 6
# Groups:   name [3]
   name   year   age gender country_of_birth source      
   <chr> <dbl> <dbl> <chr>  <chr>            <chr>       
 1 john   2010    21 male   australia        ORIGINAL    
 2 john   2011    22 male   australia        ORIGINAL    
 3 john   2012    23 male   australia        NOT ORIGINAL

我的问题：我正在尝试学习如何将上述代码转换为（Netezza）SQL代码。

为了了解如何开始，我想我可以使用 R 中的“dbplyr”库将我的“dplyr”代码转换为 SQL：

library(dbplyr)

# attempt 1
remote_df = tbl_lazy(my_data, con = simulate_mysql())

remote_df %>% 
    group_by(name) %>% 
    complete(year = first(year): last(year)) %>% 
    mutate(age = ifelse(is.na(age), first(age)+row_number()-1, age)) %>% 
    fill(c(gender, country_of_birth), .direction = "down") %>% 
    mutate(source = ifelse(is.na(source), "MISSING", source))  %>% show_query()

# attempt 2
remote_df = tbl_lazy(my_data, con = simulate_mssql())

remote_df %>% 
    group_by(name) %>% 
    complete(year = first(year): last(year)) %>% 
    mutate(age = ifelse(is.na(age), first(age)+row_number()-1, age)) %>% 
    fill(c(gender, country_of_birth), .direction = "down") %>% 
    mutate(source = ifelse(is.na(source), "MISSING", source))  %>% show_query()

# attempt 3

 con <- DBI::dbConnect(RSQLite::SQLite(), ":memory:")

 remote_df <- copy_to(con, my_data)

remote_df %>% 
    group_by(name) %>% 
    complete(year = first(year): last(year)) %>% 
    mutate(age = ifelse(is.na(age), first(age)+row_number()-1, age)) %>% 
    fill(c(gender, country_of_birth), .direction = "down") %>% 
    mutate(source = ifelse(is.na(source), "NOT ORIGINAL", source))



# attempt 4

memdb_frame(my_data) %>% 
    group_by(name) %>% 
    complete(year = first(year): last(year)) %>% 
    mutate(age = ifelse(is.na(age), first(age)+row_number()-1, age)) %>% 
    fill(c(gender, country_of_birth), .direction = "down") %>% 
    mutate(source = ifelse(is.na(source), "MISSING", source))  %>% show_query()

但是，所有这些尝试都给了我相同的错误：

Error in `fill()`:
x `.data` does not have explicit order.
i Please use `arrange()` or `window_order()` to make determinstic.
Run `rlang::last_error()` to see where the error occurred.

有人可以告诉我我做错了什么，以及我可以做些什么来将此 R 代码转换为 SQL 代码吗？我曾想过，也许我可以找出哪个人缺少哪些行，创建这些行 - 然后以某种方式使用 JOINS 将它们带回 SQL 中的原始数据集。

谢谢！

SQL R 联接数据操作

library(tidyverse)
library(dplyr)

# Create a data frame with the sequence of years
years_df <- data.frame(year = seq(2010, 2023))

# Perform a cross join with the original data
final <- merge(my_data, years_df, all = TRUE) %>% 
    group_by(name) %>% 
    complete(year = first(year): last(year)) %>% 
    mutate(age = ifelse(is.na(age), first(age)+row_number()-1, age)) %>% 
    fill(c(gender, country_of_birth), .direction = "down") %>% 
    mutate(source = ifelse(is.na(source), "NOT ORIGINAL", source))

以上是未经测试的！

SQL代码：

CREATE TABLE mytable (
    name VARCHAR(255),
    year INTEGER,
    age INTEGER,
    gender VARCHAR(255),
    country_of_birth VARCHAR(255),
    source VARCHAR(255)
);

INSERT INTO mytable (name, year, age, gender, country_of_birth, source) VALUES ('john', 2010, 21, 'male', 'australia', 'ORIGINAL');
INSERT INTO mytable (name, year, age, gender, country_of_birth, source) VALUES ('john', 2011, 22, 'male', 'australia', 'ORIGINAL');
etc.

带有联接的示例查询，其中列出了所有年份，并在匹配时联接到数据：

WITH RECURSIVE years (year) AS (
    SELECT 2010
    UNION ALL
    SELECT year + 1
    FROM years
    WHERE year < 2023
)
SELECT 
    t.name, years.year, t.age, t.gender, t.country_of_birth, t.source
FROM years
LEFT JOIN mytable AS t ON years.year = t.year
;

当我重新考虑这一点时，您不需要交叉连接。相反，你需要一个或者它可以表示为（在 SQL 中）。left joinleft outer join

在 R 中，可以使用合并函数执行左联接。下面是一个示例：

# Create two example data frames
df1 <- data.frame(id = c(1, 2, 3), x = c("a", "b", "c"))
df2 <- data.frame(id = c(2, 3, 4), y = c("d", "e", "f"))

# Perform a left join
left_join <- merge(df1, df2, by = "id", all.x = TRUE)

# View the result
left_join

这将创建两个数据帧 df1 和 df2，然后使用 merge 函数对 id 列执行左连接。该参数指定，即使 df2 中没有匹配的行，结果中也应包含 df1 中的所有行。生成的数据框包含 df1 中的所有行以及 df2 中的任何匹配行。all.x = TRUE

使用 JOIN 在 SQL 中添加缺失的行

Adding Missing Rows in SQL using JOINS

评论

评论