提问人:Hack-R 提问时间:8/2/2016 更新时间:8/3/2016 访问量:82
将占位符行放在 data.frame 中的正确位置,填充非唯一标签列
Put placeholder rows in correct positions in a data.frame, filling in non-unique label columns
问:
我正在编写一个 ETL 脚本,有一次有几个提取的表,它们都应该具有相同的特定维度(16 行,9 列)。
有时,由于上游数据问题,一行会丢失,即它将是 15x9。我需要在正确的位置添加缺失的行,用静态的、非唯一的标签值填充前 2 列。
我尝试创建一个模板 data.frame 并用于将非缺失值连接到模板上,在缺少数据的地方留下带有适当标签和值的缺失行。但是,联接条件不明显这一事实导致数据形状发生变化。sqldf
NA
on
# ill-fated attempt, doubles the number of rows
require(sqldf)
sqldf("select a.label1, a.label2, b.data1,
b.data2, b.data3, b.data4, b.data5,
b.data6, b.data7 from t a join d1 b
on a.label1 = b.label1 and a.label2 = b.label2")
可重复的数据如下。 在上面的示例中,对应于下面的第 2 个 Data.Frame,对应于第 3 个。d1
template
无问题数据示例(此处不缺少任何数据):
structure(list(label1 = c("BS", "Certificates", "MS", "PHD",
"BS", "Certificates", "MS", "PHD", "BS", "Certificates", "MS",
"PHD", "BS", "Certificates", "MS", "PHD"), label2 = c("INTL",
"INTL", "INTL", "INTL", "US", "US", "US", "US", "INTL", "INTL",
"INTL", "INTL", "US", "US", "US", "US"), data1 = c("fake data",
"fake data", "fake data", "fake data", "fake data", "fake data",
"fake data", "fake data", "fake data", "fake data", "fake data",
"fake data", "fake data", "fake data", "fake data", "fake data"
), data2 = c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15,
16), data3 = c("09B", "09B", "09B", "09B", "09B", "09B", "09B",
"09B", "09B", "09B", "09B", "09B", "09B", "09B", "09B", "09B"
), data4 = c(887L, 220L, 3633L, 1491L, 3331L, 379L, 9521L, 3785L,
624L, 188L, 1842L, 863L, 3119L, 428L, 7868L, 3138L), data5 = c(531L,
62L, 1114L, 625L, 2454L, 286L, 6298L, 2512L, 521L, 161L, 1344L,
649L, 2795L, 397L, 6984L, 2651L), data6 = c(35L, 12L, 389L, 253L,
468L, 140L, 3143L, 1244L, 36L, 7L, 337L, 167L, 621L, 135L, 3116L,
1222L), data7 = c(27L, 10L, 313L, 214L, 404L, 128L, 2853L, 1111L,
22L, 5L, 224L, 123L, 459L, 103L, 2650L, 984L)), .Names = c("label1",
"label2", "data1", "data2", "data3", "data4", "data5", "data6",
"data7"), row.names = c(NA, -16L), class = "data.frame")
有问题的数据示例(缺少 2 行):
structure(list(label1 = c("BS", "Certificates", "MS", "PHD",
"BS", "MS", "PHD", "BS", "Certificates", "MS", "PHD", "Certificates",
"MS", "PHD"), label2 = c("INTL", "INTL", "INTL", "INTL", "US",
"US", "US", "INTL", "INTL", "INTL", "INTL", "US", "US", "US"),
data1 = c("fake data", "fake data", "fake data", "fake data",
"fake data", "fake data", "fake data", "fake data", "fake data",
"fake data", "fake data", "fake data", "fake data", "fake data"
), data2 = c(1, 2, 3, 4, 5, 7, 8, 9, 10, 11, 12, 14, 15,
16), data3 = c("09B", "09B", "09B", "09B", "09B", "09B",
"09B", "09B", "09B", "09B", "09B", "09B", "09B", "09B"),
data4 = c(887L, 220L, 3633L, 1491L, 3331L, 9521L, 3785L,
624L, 188L, 1842L, 863L, 428L, 7868L, 3138L), data5 = c(531L,
62L, 1114L, 625L, 2454L, 6298L, 2512L, 521L, 161L, 1344L,
649L, 397L, 6984L, 2651L), data6 = c(35L, 12L, 389L, 253L,
468L, 3143L, 1244L, 36L, 7L, 337L, 167L, 135L, 3116L, 1222L
), data7 = c(27L, 10L, 313L, 214L, 404L, 2853L, 1111L, 22L,
5L, 224L, 123L, 103L, 2650L, 984L)), .Names = c("label1",
"label2", "data1", "data2", "data3", "data4", "data5", "data6",
"data7"), row.names = c(NA, -14L), class = "data.frame")
模板示例:
structure(list(label1 = structure(c(1L, 2L, 3L, 4L, 1L, 2L, 3L,
4L, 1L, 2L, 3L, 4L, 1L, 2L, 3L, 4L), .Label = c("BS", "Certificates",
"MS", "PHD"), class = "factor"), label2 = structure(c(1L, 1L,
1L, 1L, 2L, 2L, 2L, 2L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L), .Label = c("INTL",
"US"), class = "factor"), data1 = c(NA, NA, NA, NA, NA, NA, NA,
NA, NA, NA, NA, NA, NA, NA, NA, NA), data2 = c(NA, NA, NA, NA,
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA), data3 = c(NA,
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA),
data4 = c(NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,
NA, NA, NA, NA), data5 = c(NA, NA, NA, NA, NA, NA, NA, NA,
NA, NA, NA, NA, NA, NA, NA, NA), data6 = c(NA, NA, NA, NA,
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA), data7 = c(NA,
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA
)), .Names = c("label1", "label2", "data1", "data2", "data3",
"data4", "data5", "data6", "data7"), row.names = c(NA, -16L), class = "data.frame")
答:
1赞
G. Grothendieck
8/3/2016
#1
假设 和 是模板和缺少行的第二个示例。然后添加计数列并执行左联接。t
d2
tc <- transform(t, count = ave(1:nrow(t), label1, label2, FUN = seq_along))
d2c <- transform(d2, count = ave(1:nrow(d2), label1, label2, FUN = seq_along))
library(sqldf)
sqldf("select label1, label2,
b.data1, b.data2, b.data3, b.data4, b.data5, b.data6, b.data7
from tc left join d2c b using (label1, label2, count)")
给:
label1 label2 data1 data2 data3 data4 data5 data6 data7
1 BS INTL fake data 1 09B 887 531 35 27
2 Certificates INTL fake data 2 09B 220 62 12 10
3 MS INTL fake data 3 09B 3633 1114 389 313
4 PHD INTL fake data 4 09B 1491 625 253 214
5 BS US fake data 5 09B 3331 2454 468 404
6 Certificates US fake data 14 09B 428 397 135 103
7 MS US fake data 7 09B 9521 6298 3143 2853
8 PHD US fake data 8 09B 3785 2512 1244 1111
9 BS INTL fake data 9 09B 624 521 36 22
10 Certificates INTL fake data 10 09B 188 161 7 5
11 MS INTL fake data 11 09B 1842 1344 337 224
12 PHD INTL fake data 12 09B 863 649 167 123
13 BS US <NA> NA <NA> NA NA NA NA
14 Certificates US <NA> NA <NA> NA NA NA NA
15 MS US fake data 15 09B 7868 6984 3116 2650
16 PHD US fake data 16 09B 3138 2651 1222 984
如果数据字段太多,无法方便地列出,则可以将 sqldf 语句替换为这些语句。首先,它创建一个字符串,其中包含所有数据字段名称(即除前两个名称外的所有名称),用逗号分隔,然后用于将其替换为 SQL 语句。有关详细信息,请参阅并搜索 sqldf 主页 https://github.com/ggrothendieck/sqldf 以查找示例。dataNames
fn$
?fn
fn$
dataNames <- toString(paste("b", tail(names(t), -2), sep = "."))
fn$sqldf("select label1, label2, $dataNames
from tc left join d2c b using (label1, label2, count)")
评论
0赞
Hack-R
8/3/2016
我直到明天才能测试这个,但考虑到你的输出和代表,我将假设这是正确的,并非常感谢你!
评论