提问人:sidpat 提问时间:4/25/2014 最后编辑:Henriksidpat 更新时间:12/31/2021 访问量:4552
将非 NA 单元格向左移动
Shifting non-NA cells to the left
问:
我的数据集中有许多 NA,我需要将所有这些单元格(在行级别)向左移动。
示例 - 我的数据帧:
df=data.frame(x=c("l","m",NA,NA,"p"),y=c(NA,"b","c",NA,NA),z=c("u",NA,"w","x","y"))
df
x y z
1 l <NA> u
2 m b <NA>
3 <NA> c w
4 <NA> <NA> x
5 p <NA> y
我希望将上面的数据帧转换为以下内容:
x y z
1 l u NA
2 m b NA
3 c w NA
4 x <NA> NA
5 p y NA
请帮忙。
谢谢。
答:
您可以使用标准功能:apply
df=data.frame(x=c("l","m",NA,NA,"p"),y=c(NA,"b","c",NA,NA),z=c("u",NA,"w","x","y"))
df2 = as.data.frame(t(apply(df,1, function(x) { return(c(x[!is.na(x)],x[is.na(x)]) )} )))
colnames(df2) = colnames(df)
> df
x y z
1 l <NA> u
2 m b <NA>
3 <NA> c w
4 <NA> <NA> x
5 p <NA> y
> df2
x y z
1 l u <NA>
2 m b <NA>
3 c w <NA>
4 x <NA> <NA>
5 p y <NA>
如果你没有得到更简短的答案,这应该会有所帮助:
df=data.frame(x=c("l","m",NA,NA,"p"),y=c(NA,"b","c",NA,NA),z=c("u",NA,"w","x","y"))
sapply(df,as.character)
for(i in 1:nrow(df)){
sub <- df[i,c(which(!is.na(df[i,])),which(is.na(df[i,])))]
colnames(sub) <- colnames(df)
df[i,] <- sub
}
评论
as.character
sapply(df,as.character)
感谢 @Richard Scriven 的良好观察
A) 与 和 和 用于聚合is.na
order
lapply
rbind
nosort.df<-do.call(rbind,lapply(1:nrow(df),function(x) { z=df[x,][order(is.na(df[x,]))];colnames(z)<-c("x","y","z");return(z) } ))
> nosort.df
x y z
1 l u <NA>
2 m b <NA>
3 c w <NA>
4 x <NA> <NA>
5 p y <NA>
B) 如果需要排序行:
使用 和sort
lapply
rbind
sort.df<-do.call(rbind,lapply(1:nrow(df),function(x) { z=sort(df[x,],na.last=TRUE);colnames(z)<-c("x","y","z");return(z) } ))
> sort.df
x y z
1 l u <NA>
2 b m <NA>
3 c w <NA>
4 x <NA> <NA>
5 p y <NA>
评论
如果您不想使用 VBA,可以尝试以下步骤。
1. Select your dataset
2. Replace NA will empty cells
3. press F5 and select blanks ok
4. right click on any of the selection and delete (left)
我希望这会有所帮助。
另一个语法较短的答案:
df=data.frame(x=c("l","m",NA,NA,"p"),y=c(NA,"b","c",NA,NA),z=c("u",NA,"w","x","y"))
x y z
[1,] "l" NA "u"
[2,] "m" "b" NA
[3,] NA "c" "w"
[4,] NA NA "x"
[5,] "p" NA "y"
sorted.df <- as.data.frame(t(apply(df, 1, function(x) x[order(is.na(x))])))
[,1] [,2] [,3]
[1,] "l" "u" NA
[2,] "m" "b" NA
[3,] "c" "w" NA
[4,] "x" NA NA
[5,] "p" "y" NA
评论
在这里,我们还可以使用包中的函数来发挥很大的优势:pmap
purrr
library(dplyr)
library(purrr)
df %>%
pmap(., ~ c(c(...)[!is.na(c(...))], c(...)[is.na(c(...))])) %>%
exec(rbind, !!!.) %>%
as_tibble()
# A tibble: 5 x 3
x z y
<chr> <chr> <chr>
1 l u NA
2 m b NA
3 c w NA
4 x NA NA
5 p y NA
我在我的软件包中包含了此任务的功能(可在 CRAN 上使用)。它允许向右、向左甚至向上和向下移动:dedupewider
NA
library(dedupewider)
df <- data.frame(x = c("l", "m", NA, NA, "p"),
y = c(NA, "b", "c", NA, NA),
z = c("u", NA, "w", "x", "y"))
na_move(df) # 'right' direction is by default
#> x y z
#> 1 l u NA
#> 2 m b NA
#> 3 c w NA
#> 4 x <NA> NA
#> 5 p y NA
它实现了重塑数据(从宽格式到长格式,再从宽格式到宽格式)的解决方案,并在内部使用函数。因此,它比使用以下标准解决方案要快得多:data.table
apply
library(dedupewider)
library(microbenchmark)
df <- data.frame(x = c("l", "m", NA, NA, "p"),
y = c(NA, "b", "c", NA, NA),
z = c("u", NA, "w", "x", "y"))
df <- do.call(rbind, replicate(10000, df, simplify = FALSE))
apply_function <- function(df) {
as.data.frame(t(apply(df, 1, function(x) c(x[!is.na(x)], x[is.na(x)]))))
}
microbenchmark(apply_function(df), na_move(df))
#> Unit: milliseconds
#> expr min lq mean median uq max
#> apply_function(df) 289.2032 361.0178 475.65281 425.79355 545.6405 999.4086
#> na_move(df) 51.0419 58.1426 75.32407 65.01445 92.8706 216.6384
评论
自从这个问题被问到以来,已经有许多重复的问题(这里和这里)。我收集(并改进)了一些更惯用的答案,并将它们与我自己的实现进行了基准测试。Rcpp
为简单起见,我比较了将字符矩阵作为输入并返回作为输出的函数,而不是仅包含字符变量的数据帧。你总是可以用 和 从一个地方强制到另一个(例如,见底部)。as.matrix
as.data.frame
Rcpp::sourceCpp(code = '
#include <Rcpp.h>
using namespace Rcpp;
// [[Rcpp::export]]
void shift_na_in_place(CharacterMatrix x)
{
int m = x.nrow();
int n = x.ncol();
for (int i = 0, k = 0, k0 = 0; i < m; ++i) {
for (int j = 0; j < n; ++j) {
if (x[k] != NA_STRING) {
x[k0] = x[k];
k0 += m;
}
k += m;
}
while (k0 < k) {
x[k0] = NA_STRING;
k0 += m;
}
k = (k % m) + 1;
k0 = k;
}
if (x.attr("dimnames") != R_NilValue) {
List dn = x.attr("dimnames");
dn[1] = R_NilValue;
if (dn.attr("names") != R_NilValue) {
CharacterVector ndn = dn.attr("names");
ndn[1] = "";
}
}
}
// [[Rcpp::export]]
CharacterMatrix shift_na(CharacterMatrix x)
{
CharacterMatrix y = clone(x);
shift_na_in_place(y);
return y;
}
')
f1 <- function(x) {
t(apply(x, 1L, function(y) {r <- is.na(y); c(y[!r], y[r])}))
}
f2 <- function(x) {
t(apply(x, 1L, function(y) y[order(is.na(y), method = "radix")]))
}
f3 <- function(x) {
d <- dim(x)
dn <- dimnames(x)
matrix(x[order(row(x), is.na(x), method = "radix")],
nrow = d[1L], ncol = d[2L], byrow = TRUE,
dimnames = if (!is.null(dn)) c(dn[1L], list(NULL)))
}
f4 <- function(x) {
d <- dim(x)
dn <- dimnames(x)
matrix(x[order(is.na(x) + (row(x) - 1L) * 2L + 1L, method = "radix")],
nrow = d[1L], ncol = d[2L], byrow = TRUE,
dimnames = if (!is.null(dn)) c(dn[1L], list(NULL)))
}
set.seed(1L)
m <- 1e+05L
n <- 10L
x <- sample(c(letters, NA), size = m * n, replace = TRUE, prob = c(rep(1, 26), 13))
dim(x) <- c(m, n)
microbenchmark::microbenchmark(shift_na(x), f1(x), f2(x), f3(x), f4(x), check = "identical")
Unit: milliseconds
expr min lq mean median uq max neval
shift_na(x) 10.04959 10.32019 10.82935 10.41968 10.60104 22.69412 100
f1(x) 141.95959 150.83875 180.49025 167.01266 211.52478 248.07587 100
f2(x) 722.27211 759.75710 780.69368 773.26920 797.01253 857.07905 100
f3(x) 18.45201 19.15436 22.47760 21.59577 22.40543 66.47121 100
f4(x) 30.03168 31.62765 35.22960 33.92801 35.06384 85.92661 100
正如您所料,专用实现速度最快,但速度并不慢。一些细节:Rcpp
shift_na
f3
f4
f1
和 call ,它建立在 R 循环之上,因此它们很慢也就不足为奇了。f2
apply
for
f3
并且必须为 和 分配内存,这可能是足够大的障碍。f4
is.na(x)
row(x)
x
f3
比,因为当要排序的整数向量的范围(最大值减去最小值)小于 100000 时,排序使用更快的算法(请参阅)。在这里,范围是:f4
"radix"
?sort
is.na(x): 1 row(x): 99999 is.na(x) + (row(x) - 1L) * 2L + 1L: 199999
shift_na(x)
创建副本并就地修改副本。如果由于副本非常大而不能或不想为副本分配内存,则可以就地进行修改。x
x
shift_na_in_place(x)
x
shift_na_in_place
如果数据框包含字符变量,而不是字符矩阵,则应优先。在这种情况下,没有必要保留中间体;它可以就地修改:shift_na
data
as.matrix(data)
x <- as.matrix(data) shift_na_in_place(x) newdata <- as.data.frame(x)
评论