计算学生连续两次考试不及格的概率?

Calculating the probability that a student fails two consecutive exams?

提问人:stats_noob 提问时间:12/10/2022 最后编辑:stats_noob 更新时间:12/22/2022 访问量:384

问:

我正在使用 R 编程语言。我有以下数据集 - 学生多次参加考试,他们要么通过(“1”),要么不及格(“0”)。数据如下所示:

id = sample.int(10000, 100000, replace = TRUE)
res = c(1,0)
results = sample(res, 100000, replace = TRUE)
date_exam_taken = sample(seq(as.Date('1999/01/01'), as.Date('2020/01/01'), by="day"), 100000, replace = TRUE)

      id results date_exam_taken exam_number
7992   1       1      2004-04-23           1
24837  1       0      2004-12-10           2
12331  1       1      2007-01-19           3
34396  1       0      2007-02-21           4
85250  1       0      2007-09-26           5
11254  1       1      2009-12-20           6

my_data = data.frame(id, results, date_exam_taken)
my_data <- my_data[order(my_data$id, my_data$date_exam_taken),]

my_data$general_id = 1:nrow(my_data)
my_data$exam_number = ave(my_data$general_id, my_data$id, FUN = seq_along)
my_data$general_id = NULL

我很想知道 - 假设一个学生考试不及格,这个学生下一次考试不及格的概率是多少?(例如,第一次考试不及格 - 第二次考试不及格的概率是多少?第 5 次考试不及格 - 第 6 次考试不及格的概率是多少?我写了以下循环来回答这个问题:

my_list = list()

for (i in 1:length(unique(my_data$id)))
    
{ 
    {tryCatch({
        
        start_i = my_data[my_data$id == i,]
        
        pairs_i =  data.frame(first = head(start_i$results, -1), second = tail(start_i$results, -1))
        frame_i =  as.data.frame(table(pairs_i))
        frame_i$id = i
        print(frame_i)
        my_list[[i]] = frame_i
    }, error = function(e){})
    }}


 final = do.call(rbind.data.frame, my_list)
   #################################################



library(dplyr)
total_1 = final %>% group_by(first, second) %>% summarise(totals = n())
total_2 = total_1 %>% group_by(first) %>% summarise(sum = sum(totals))

join = merge(x = total_1, y = total_2, by = "first", all = TRUE)
join$probs = join$totals/join$sum

最终答案如下所示:

  first second totals   sum     probs
1     0      0   9817 19557 0.5019686
2     0      1   9740 19557 0.4980314
3     1      0   9727 19498 0.4988717
4     1      1   9771 19498 0.5011283

现在,我正在尝试修改上面的代码,以便在“第二级”进行分析 - 也就是说,我想找出学生在前两次考试中失败的情况下,下一次考试失败的概率?例如,假设学生在第 3 次和第 4 次考试中失败 - 学生在第 5 次考试中失败的概率是多少?

我认为最终的答案应该是这样的:

# note: "first" and "second" refer to any consecutive exams (not the literal first and second exam), and "third" refers to the exam occurring immediately after the "second" exam

  first second third totals sums probs
1     1      1     1    ...  ...   ...
2     0      0     0    ...  ...   ...
3     1      0     1    ...  ...   ...
4     0      1     0    ...  ...   ...
5     0      0     1    ...  ...   ...
6     1      1     0    ...  ...   ...
7     0      1     1    ...  ...   ...
8     1      0     0    ...  ...   ...

我尝试手动修改我的代码以满足以下要求:

library(stringr)
my_list = list()

for (i in 1:length(unique(my_data$id)))

{ 
    {tryCatch({


start_i = my_data[my_data$id == i,]

vals_i = as.numeric(paste(start_i$results, collapse = ""))
L_1_i = lengths(gregexpr("111", vals_i))
L_2_i = lengths(gregexpr("000", vals_i))
L_3_i = lengths(gregexpr("101", vals_i))
L_4_i = lengths(gregexpr("010", vals_i))
L_5_i = lengths(gregexpr("001", vals_i))
L_6_i = lengths(gregexpr("110", vals_i))
L_7_i = lengths(gregexpr("011", vals_i))
L_8_i = lengths(gregexpr("100", vals_i))

frame_i = data.frame(class = c("111","000","101","010","001","110","011","100"), values = c(L_1_i, L_2_i, L_3_i, L_4_i, L_5_i, L_6_i, L_7_i, L_8_i))
frame_i$id = i
print(frame_i)


        my_list[[i]] = frame_i
    }, error = function(e){})
    }}


final = do.call(rbind.data.frame, my_list)


final$first = substr(final$class, 1,1)
final$second = substr(final$class, 2,2)
final$third = substr(final$class, 3,3)


total_1 = final %>% group_by(first, second, third) %>% summarise(totals = n())
total_2 = total_1 %>% group_by(first, second) %>% summarise(sum = sum(totals))

join = merge(x = total_1, y = total_2, by = c("first", "second"), all = TRUE)
join$probs = join$totals/join$sum

举个例子 - 为了回答原来的问题,我们现在要查询结果(假设我做对了)。以下是学生通过考试的概率,因为该学生在前两次考试中都失败了:

 first second third totals   sum probs
1     0      0     0  10000 20000   0.5
2     0      0     1  10000 20000   0.5
3     0      1     0  10000 20000   0.5
4     0      1     1  10000 20000   0.5
5     1      0     0  10000 20000   0.5
6     1      0     1  10000 20000   0.5
7     1      1     0  10000 20000   0.5
8     1      1     1  10000 20000   0.5

join[join$first == 0 & join$second ==0, & join$third == 1,]

我不确定我所做的是否正确 - 有人可以帮我纠正我的代码吗(并可能向我展示一种“更好”的方法来做到这一点)?

注意:这些概率看起来是两个“统一”的,让我倾向于认为我做错了什么......

R 字符串 循环 DPLYR 数据操作

评论

1赞 Cole 12/19/2022
由于我们定义每个学生有 50/50 的几率失败每个事件,您会发现所有答案在下一个事件中都有 50% 的成功机会,即使他们在前两个事件中失败。这个问题似乎更倾向于假设失败可能预示着未来的性能,但我们定义的随机生成的数据并没有真正让我们弄清楚这一点。查看二项分布可能有助于您获得所需的内容。

答:

4赞 geom_na 12/10/2022 #1
P(A|B) = P(B|A)*P(A)/P(B)

其中 A 是考试失败的概率,B 是考试失败的概率。nn-1 and n-2

我们可以写一个函数来计算P(A|B)

d <- my_data
d %>% 
  dplyr::pivot_wider(id, names_from = "exam_number", values_from = "results") -> dd

p_fail_given_failfail <- function(n){ #n is integer > 2
  #check if the student took the exam n. Remove them if they didn't.
  indx.na <- is.na(dd[, n+1])
  dd <- dd[!indx.na, ]
  
  #Calculate probabilities
  p_BA <- nrow(dd[dd[, n+1]==0 & dd[, n]==0 & dd[, n-1]==0, ]) /nrow(dd[dd[, n+1]==0, ])
  p_A <- nrow(dd[dd[, n+1]==0, ]) / nrow(dd)
  p_B <- nrow(dd[dd[, n]==0 & dd[, n-1]==0, ]) / nrow(dd)

  p_AB <- p_BA*p_A/p_B
  return(p_AB)
}

p_fail_given_failfail(3) #prob of failing exam3, given failed exam2 and exam1
#0.5084479

评论

0赞 stats_noob 12/11/2022
非常感谢您的回答!我想知道这个代码是否可以适应计算通过下一次考试的概率,因为前 3 次考试都失败了?
3赞 Cole 12/19/2022 #2

这种方法概括了以下问题的答案:“如果我有一个抛硬币分布,那么在前 n 个事件失败后,测试失败 n + 1 测试的概率是多少”。

## Define parameters
students = 10000L
tests = 3L

## simulate tests. 
##  Generates students x tests matrix of test results.
test_results = replicate(tests,
                         sample(c(TRUE,FALSE),
                                students,
                                replace = TRUE))

## determine student set that were failures up to test - 1
complete_failures = rowSums(test_results[, seq_len(tests-1), drop = FALSE]) == 0L
final_test = test_results[complete_failures, tests]

## summarize results
prop.table(table(final_test))
#> final_test
#>     FALSE      TRUE 
#> 0.5150162 0.4849838

OP可能对二项分布更感兴趣。也就是说,在 n 次试验中,只有 1 次成功的概率是多少。在这种情况下,我们可以使用类似于以下内容:pbinom

pbinom(0.5, 1, 0.5)
#> [1] 0.5
pbinom(0.5, 2, 0.5)
#> [1] 0.25
pbinom(0.5, 3, 0.5)
#> [1] 0.125
0赞 Sutanu Majumdar 12/20/2022 #3

同意@cole但可能是负二项式(R 包 rnbinom (p,n,r),因为 n = 第一次通过前的考试次数 (r=1),p=0.5(每次考试成功的概率),每次考试都是 i.i.d 伯努利。

评论

0赞 Community 12/22/2022
正如目前所写的那样,你的答案尚不清楚。请编辑以添加其他详细信息,以帮助其他人了解这如何解决所提出的问题。您可以在帮助中心找到有关如何写出好答案的更多信息。