为每个连续序列创建一个组号-解网

问：

我在下面有data.frame。我想添加一个列“g”，根据列中的连续序列对我的数据进行分类。也就是说，h_no的第一个序列是第 1 组，第二个序列（1 到 7）是第 2 组，依此类推，如最后一列“g”所示。h_no1, 2, 3, 4h_no

h_no   h_freq    h_freqsq g
1     0.09091 0.008264628 1
2     0.00000 0.000000000 1
3     0.04545 0.002065702 1
4     0.00000 0.000000000 1  
1     0.13636 0.018594050 2
2     0.00000 0.000000000 2
3     0.00000 0.000000000 2
4     0.04545 0.002065702 2
5     0.31818 0.101238512 2
6     0.00000 0.000000000 2
7     0.50000 0.250000000 2 
1     0.13636 0.018594050 3 
2     0.09091 0.008264628 3
3     0.40909 0.167354628 3
4     0.04545 0.002065702 3

R 数据帧序列

# make some fake data
your.df <- data.frame(no = c(1:4, 1:7, 1:5), h_freq = runif(16), h_freqsq = runif(16))

# find where one appears and 
from <- which(your.df$no == 1)
to <- c((from-1)[-1], nrow(your.df)) # up to which point the sequence runs

# generate a sequence (len) and based on its length, repeat a consecutive number len times
get.seq <- mapply(from, to, 1:length(from), FUN = function(x, y, z) {
            len <- length(seq(from = x[1], to = y[1]))
            return(rep(z, times = len))
         })

# when we unlist, we get a vector
your.df$group <- unlist(get.seq)
# and append it to your original data.frame. since this is
# designating a group, it makes sense to make it a factor
your.df$group <- as.factor(your.df$group)


   no     h_freq   h_freqsq group
1   1 0.40998238 0.06463876     1
2   2 0.98086928 0.33093795     1
3   3 0.28908651 0.74077119     1
4   4 0.10476768 0.56784786     1
5   1 0.75478995 0.60479945     2
6   2 0.26974011 0.95231761     2
7   3 0.53676266 0.74370154     2
8   4 0.99784066 0.37499294     2
9   5 0.89771767 0.83467805     2
10  6 0.05363139 0.32066178     2
11  7 0.71741529 0.84572717     2
12  1 0.10654430 0.32917711     3
13  2 0.41971959 0.87155514     3
14  3 0.32432646 0.65789294     3
15  4 0.77896780 0.27599187     3
16  5 0.06100008 0.55399326     3

加工

我们目前只关心列，因此我们可以从数据框中提取它：h_no

> h_no <- data$h_no

我们想检测何时不上升，这可以通过计算连续元素之间的差异何时为负或零来做到这一点。R 提供了 diff 函数，它为我们提供了差异的向量：h_no

> d.h_no <- diff(h_no)
> d.h_no
 [1]  1  1  1 -3  1  1  1  1  1  1 -6  1  1  1

一旦我们有了这些，找到那些非积极的就很简单了：

> nonpos <- d.h_no <= 0
> nonpos
 [1] FALSE FALSE FALSE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE  TRUE FALSE
[13] FALSE FALSE

在 R 中，和与和基本相同，因此如果我们得到的累积和，它将在（几乎）适当的点中增加 1。cumsum 函数（基本上与相反）可以做到这一点。TRUEFALSE10nonposdiff

> cumsum(nonpos)
 [1] 0 0 0 1 1 1 1 1 1 1 2 2 2 2

但是，有两个问题：数字太小了;而且，我们缺少第一个元素（第一类应该有四个）。

第一个问题很简单：。第二个只需要在向量的前面添加 a，因为第一个元素总是在类中：1+cumsum(nonpos)11

 > classes <- c(1, 1 + cumsum(nonpos))
 > classes
  [1] 1 1 1 1 2 2 2 2 2 2 2 3 3 3 3

现在，我们可以使用 cbind 将其附加回数据框（通过使用语法，我们可以为列提供标题）：class=class

 > data_w_classes <- cbind(data, class=classes)

现在包含结果。data_w_classes

最终结果

我们可以将这些行压缩在一起，并将它们全部包装成一个函数，使其更易于使用：

classify <- function(data) {
   cbind(data, class=c(1, 1 + cumsum(diff(data$h_no) <= 0)))
}

或者，既然 to 是一个因素是有道理的：class

classify <- function(data) {
   cbind(data, class=factor(c(1, 1 + cumsum(diff(data$h_no) <= 0))))
}

您可以使用以下任一函数：

> classified <- classify(data) # doesn't overwrite data
> data <- classify(data) # data now has the "class" column

（这种解决这个问题的方法很好，因为它避免了显式迭代，这通常推荐用于 R，并且避免生成大量中间向量和列表等。而且它如何写在一行上也有点整洁:)）

13赞 user1333396 4/14/2012 #4

简单：您的数据框是 A

b <- A[,1]
b <- b==1
b <- cumsum(b)

然后你得到 b 列。

mytb<-read.table(text="h_no  h_freq  h_freqsq group
1     0.09091 0.008264628 1
2     0.00000 0.000000000 1
3     0.04545 0.002065702 1
4     0.00000 0.000000000 1  
1     0.13636 0.018594050 2
2     0.00000 0.000000000 2
3     0.00000 0.000000000 2
4     0.04545 0.002065702 2
5     0.31818 0.101238512 2
6     0.00000 0.000000000 2
7     0.50000 0.250000000 2 
1     0.13636 0.018594050 3 
2     0.09091 0.008264628 3
3     0.40909 0.167354628 3
4     0.04545 0.002065702 3", header=T, stringsAsFactors=F)
mytb$group<-NULL

positionsof1s<-grep(1,mytb$h_no)

mytb$newgroup<-unlist(mapply(function(x,y) 
  rep(x,y),                      # repeat x number y times
  x= 1:length(positionsof1s),    # x is 1 to number of nth group = g1:g3
  y= c( diff(positionsof1s),     # y is number of repeats of groups g1 to penultimate (g2) = 4, 7
        nrow(mytb)-              # this line and the following gives number of repeat for last group (g3)
          (positionsof1s[length(positionsof1s )]-1 )  # number of rows - position of penultimate group (g2) 
      ) ) )
mytb

0赞 Gregor Thomas 3/28/2021 #8

对于这样的事情，该功能非常方便。我们减去序列以将连续序列转换为常量，然后用于创建组 ID：data.tablerleid1:nrow(data)rleid

data$g = data.table::rleid(data$h_no - 1:nrow(data))

1赞 Maël 5/5/2023 #9

一个不错的选择是，它从连续数字序列创建唯一 ID。该功能经过高度优化和灵活：collapse::seqid

collapse::seqid(df$h_no)
#[1] 1 1 1 1 2 2 2 2 2 2 2 3 3 3 3

collapse::seqid(c(1:5, 7:10))
#[1] 1 1 1 1 1 2 2 2 2

collapse::seqid(c(1:5, 7:10), del = 2) #With a delimitation of 2
#[1] 1 2 3 4 5 5 6 7 8

collapse::seqid(c(1, NA, 2), na.skip = TRUE)
#[1]  1 NA  1

上一个：按顺序创建重复值的序列？

下一个：如何在MySQL中创建序列？

为每个连续序列创建一个组号

Create a group number for each consecutive sequence

评论

评论

评论

加工

最终结果

评论