将占位符行放在 data.frame 中的正确位置,填充非唯一标签列

Put placeholder rows in correct positions in a data.frame, filling in non-unique label columns

提问人:Hack-R 提问时间:8/2/2016 更新时间:8/3/2016 访问量:82

问:

我正在编写一个 ETL 脚本,有一次有几个提取的表,它们都应该具有相同的特定维度(16 行,9 列)。

有时,由于上游数据问题,一行会丢失,即它将是 15x9。我需要在正确的位置添加缺失的行,用静态的、非唯一的标签值填充前 2 列。

我尝试创建一个模板 data.frame 并用于将非缺失值连接到模板上,在缺少数据的地方留下带有适当标签和值的缺失行。但是,联接条件不明显这一事实导致数据形状发生变化。sqldfNAon

# ill-fated attempt, doubles the number of rows
require(sqldf)
sqldf("select a.label1, a.label2, b.data1,
        b.data2, b.data3, b.data4, b.data5, 
        b.data6, b.data7 from t a join d1 b
      on a.label1 = b.label1 and a.label2 = b.label2")

可重复的数据如下。 在上面的示例中,对应于下面的第 2 个 Data.Frame,对应于第 3 个。d1template

无问题数据示例(此处不缺少任何数据):

structure(list(label1 = c("BS", "Certificates", "MS", "PHD", 
"BS", "Certificates", "MS", "PHD", "BS", "Certificates", "MS", 
"PHD", "BS", "Certificates", "MS", "PHD"), label2 = c("INTL", 
"INTL", "INTL", "INTL", "US", "US", "US", "US", "INTL", "INTL", 
"INTL", "INTL", "US", "US", "US", "US"), data1 = c("fake data", 
"fake data", "fake data", "fake data", "fake data", "fake data", 
"fake data", "fake data", "fake data", "fake data", "fake data", 
"fake data", "fake data", "fake data", "fake data", "fake data"
), data2 = c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 
16), data3 = c("09B", "09B", "09B", "09B", "09B", "09B", "09B", 
"09B", "09B", "09B", "09B", "09B", "09B", "09B", "09B", "09B"
), data4 = c(887L, 220L, 3633L, 1491L, 3331L, 379L, 9521L, 3785L, 
624L, 188L, 1842L, 863L, 3119L, 428L, 7868L, 3138L), data5 = c(531L, 
62L, 1114L, 625L, 2454L, 286L, 6298L, 2512L, 521L, 161L, 1344L, 
649L, 2795L, 397L, 6984L, 2651L), data6 = c(35L, 12L, 389L, 253L, 
468L, 140L, 3143L, 1244L, 36L, 7L, 337L, 167L, 621L, 135L, 3116L, 
1222L), data7 = c(27L, 10L, 313L, 214L, 404L, 128L, 2853L, 1111L, 
22L, 5L, 224L, 123L, 459L, 103L, 2650L, 984L)), .Names = c("label1", 
"label2", "data1", "data2", "data3", "data4", "data5", "data6", 
"data7"), row.names = c(NA, -16L), class = "data.frame")

有问题的数据示例(缺少 2 行):

structure(list(label1 = c("BS", "Certificates", "MS", "PHD", 
"BS", "MS", "PHD", "BS", "Certificates", "MS", "PHD", "Certificates", 
"MS", "PHD"), label2 = c("INTL", "INTL", "INTL", "INTL", "US", 
"US", "US", "INTL", "INTL", "INTL", "INTL", "US", "US", "US"), 
    data1 = c("fake data", "fake data", "fake data", "fake data", 
    "fake data", "fake data", "fake data", "fake data", "fake data", 
    "fake data", "fake data", "fake data", "fake data", "fake data"
    ), data2 = c(1, 2, 3, 4, 5, 7, 8, 9, 10, 11, 12, 14, 15, 
    16), data3 = c("09B", "09B", "09B", "09B", "09B", "09B", 
    "09B", "09B", "09B", "09B", "09B", "09B", "09B", "09B"), 
    data4 = c(887L, 220L, 3633L, 1491L, 3331L, 9521L, 3785L, 
    624L, 188L, 1842L, 863L, 428L, 7868L, 3138L), data5 = c(531L, 
    62L, 1114L, 625L, 2454L, 6298L, 2512L, 521L, 161L, 1344L, 
    649L, 397L, 6984L, 2651L), data6 = c(35L, 12L, 389L, 253L, 
    468L, 3143L, 1244L, 36L, 7L, 337L, 167L, 135L, 3116L, 1222L
    ), data7 = c(27L, 10L, 313L, 214L, 404L, 2853L, 1111L, 22L, 
    5L, 224L, 123L, 103L, 2650L, 984L)), .Names = c("label1", 
"label2", "data1", "data2", "data3", "data4", "data5", "data6", 
"data7"), row.names = c(NA, -14L), class = "data.frame")

模板示例:

structure(list(label1 = structure(c(1L, 2L, 3L, 4L, 1L, 2L, 3L, 
4L, 1L, 2L, 3L, 4L, 1L, 2L, 3L, 4L), .Label = c("BS", "Certificates", 
"MS", "PHD"), class = "factor"), label2 = structure(c(1L, 1L, 
1L, 1L, 2L, 2L, 2L, 2L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L), .Label = c("INTL", 
"US"), class = "factor"), data1 = c(NA, NA, NA, NA, NA, NA, NA, 
NA, NA, NA, NA, NA, NA, NA, NA, NA), data2 = c(NA, NA, NA, NA, 
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA), data3 = c(NA, 
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA), 
    data4 = c(NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 
    NA, NA, NA, NA), data5 = c(NA, NA, NA, NA, NA, NA, NA, NA, 
    NA, NA, NA, NA, NA, NA, NA, NA), data6 = c(NA, NA, NA, NA, 
    NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA), data7 = c(NA, 
    NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA
    )), .Names = c("label1", "label2", "data1", "data2", "data3", 
"data4", "data5", "data6", "data7"), row.names = c(NA, -16L), class = "data.frame")
r

评论


答:

1赞 G. Grothendieck 8/3/2016 #1

假设 和 是模板和缺少行的第二个示例。然后添加计数列并执行左联接。td2

tc <- transform(t, count = ave(1:nrow(t), label1, label2, FUN = seq_along))
d2c <- transform(d2, count = ave(1:nrow(d2), label1, label2, FUN = seq_along))

library(sqldf)
sqldf("select label1, label2, 
              b.data1, b.data2, b.data3, b.data4, b.data5, b.data6, b.data7
       from tc left join d2c b using (label1, label2, count)")

给:

         label1 label2     data1 data2 data3 data4 data5 data6 data7
1            BS   INTL fake data     1   09B   887   531    35    27
2  Certificates   INTL fake data     2   09B   220    62    12    10
3            MS   INTL fake data     3   09B  3633  1114   389   313
4           PHD   INTL fake data     4   09B  1491   625   253   214
5            BS     US fake data     5   09B  3331  2454   468   404
6  Certificates     US fake data    14   09B   428   397   135   103
7            MS     US fake data     7   09B  9521  6298  3143  2853
8           PHD     US fake data     8   09B  3785  2512  1244  1111
9            BS   INTL fake data     9   09B   624   521    36    22
10 Certificates   INTL fake data    10   09B   188   161     7     5
11           MS   INTL fake data    11   09B  1842  1344   337   224
12          PHD   INTL fake data    12   09B   863   649   167   123
13           BS     US      <NA>    NA  <NA>    NA    NA    NA    NA
14 Certificates     US      <NA>    NA  <NA>    NA    NA    NA    NA
15           MS     US fake data    15   09B  7868  6984  3116  2650
16          PHD     US fake data    16   09B  3138  2651  1222   984

如果数据字段太多,无法方便地列出,则可以将 sqldf 语句替换为这些语句。首先,它创建一个字符串,其中包含所有数据字段名称(即除前两个名称外的所有名称),用逗号分隔,然后用于将其替换为 SQL 语句。有关详细信息,请参阅并搜索 sqldf 主页 https://github.com/ggrothendieck/sqldf 以查找示例。dataNamesfn$?fnfn$

dataNames <- toString(paste("b", tail(names(t), -2), sep = "."))
fn$sqldf("select label1, label2, $dataNames
       from tc left join d2c b using (label1, label2, count)")

评论

0赞 Hack-R 8/3/2016
我直到明天才能测试这个,但考虑到你的输出和代表,我将假设这是正确的,并非常感谢你!