计算学生连续两次考试不及格的概率？-解网

问：

我正在使用 R 编程语言。我有以下数据集 - 学生多次参加考试，他们要么通过（“1”），要么不及格（“0”）。数据如下所示：

id = sample.int(10000, 100000, replace = TRUE)
res = c(1,0)
results = sample(res, 100000, replace = TRUE)
date_exam_taken = sample(seq(as.Date('1999/01/01'), as.Date('2020/01/01'), by="day"), 100000, replace = TRUE)

      id results date_exam_taken exam_number
7992   1       1      2004-04-23           1
24837  1       0      2004-12-10           2
12331  1       1      2007-01-19           3
34396  1       0      2007-02-21           4
85250  1       0      2007-09-26           5
11254  1       1      2009-12-20           6

my_data = data.frame(id, results, date_exam_taken)
my_data <- my_data[order(my_data$id, my_data$date_exam_taken),]

my_data$general_id = 1:nrow(my_data)
my_data$exam_number = ave(my_data$general_id, my_data$id, FUN = seq_along)
my_data$general_id = NULL

我很想知道 - 假设一个学生考试不及格，这个学生下一次考试不及格的概率是多少？（例如，第一次考试不及格 - 第二次考试不及格的概率是多少？第 5 次考试不及格 - 第 6 次考试不及格的概率是多少？我写了以下循环来回答这个问题：

my_list = list()

for (i in 1:length(unique(my_data$id)))
    
{ 
    {tryCatch({
        
        start_i = my_data[my_data$id == i,]
        
        pairs_i =  data.frame(first = head(start_i$results, -1), second = tail(start_i$results, -1))
        frame_i =  as.data.frame(table(pairs_i))
        frame_i$id = i
        print(frame_i)
        my_list[[i]] = frame_i
    }, error = function(e){})
    }}


 final = do.call(rbind.data.frame, my_list)
   #################################################



library(dplyr)
total_1 = final %>% group_by(first, second) %>% summarise(totals = n())
total_2 = total_1 %>% group_by(first) %>% summarise(sum = sum(totals))

join = merge(x = total_1, y = total_2, by = "first", all = TRUE)
join$probs = join$totals/join$sum

最终答案如下所示：

  first second totals   sum     probs
1     0      0   9817 19557 0.5019686
2     0      1   9740 19557 0.4980314
3     1      0   9727 19498 0.4988717
4     1      1   9771 19498 0.5011283

现在，我正在尝试修改上面的代码，以便在“第二级”进行分析 - 也就是说，我想找出学生在前两次考试中失败的情况下，下一次考试失败的概率？例如，假设学生在第 3 次和第 4 次考试中失败 - 学生在第 5 次考试中失败的概率是多少？

我认为最终的答案应该是这样的：

# note: "first" and "second" refer to any consecutive exams (not the literal first and second exam), and "third" refers to the exam occurring immediately after the "second" exam

  first second third totals sums probs
1     1      1     1    ...  ...   ...
2     0      0     0    ...  ...   ...
3     1      0     1    ...  ...   ...
4     0      1     0    ...  ...   ...
5     0      0     1    ...  ...   ...
6     1      1     0    ...  ...   ...
7     0      1     1    ...  ...   ...
8     1      0     0    ...  ...   ...

我尝试手动修改我的代码以满足以下要求：

library(stringr)
my_list = list()

for (i in 1:length(unique(my_data$id)))

{ 
    {tryCatch({


start_i = my_data[my_data$id == i,]

vals_i = as.numeric(paste(start_i$results, collapse = ""))
L_1_i = lengths(gregexpr("111", vals_i))
L_2_i = lengths(gregexpr("000", vals_i))
L_3_i = lengths(gregexpr("101", vals_i))
L_4_i = lengths(gregexpr("010", vals_i))
L_5_i = lengths(gregexpr("001", vals_i))
L_6_i = lengths(gregexpr("110", vals_i))
L_7_i = lengths(gregexpr("011", vals_i))
L_8_i = lengths(gregexpr("100", vals_i))

frame_i = data.frame(class = c("111","000","101","010","001","110","011","100"), values = c(L_1_i, L_2_i, L_3_i, L_4_i, L_5_i, L_6_i, L_7_i, L_8_i))
frame_i$id = i
print(frame_i)


        my_list[[i]] = frame_i
    }, error = function(e){})
    }}


final = do.call(rbind.data.frame, my_list)


final$first = substr(final$class, 1,1)
final$second = substr(final$class, 2,2)
final$third = substr(final$class, 3,3)


total_1 = final %>% group_by(first, second, third) %>% summarise(totals = n())
total_2 = total_1 %>% group_by(first, second) %>% summarise(sum = sum(totals))

join = merge(x = total_1, y = total_2, by = c("first", "second"), all = TRUE)
join$probs = join$totals/join$sum

举个例子 - 为了回答原来的问题，我们现在要查询结果（假设我做对了）。以下是学生通过考试的概率，因为该学生在前两次考试中都失败了：

 first second third totals   sum probs
1     0      0     0  10000 20000   0.5
2     0      0     1  10000 20000   0.5
3     0      1     0  10000 20000   0.5
4     0      1     1  10000 20000   0.5
5     1      0     0  10000 20000   0.5
6     1      0     1  10000 20000   0.5
7     1      1     0  10000 20000   0.5
8     1      1     1  10000 20000   0.5

join[join$first == 0 & join$second ==0, & join$third == 1,]

我不确定我所做的是否正确 - 有人可以帮我纠正我的代码吗（并可能向我展示一种“更好”的方法来做到这一点）？

注意：这些概率看起来是两个“统一”的，让我倾向于认为我做错了什么......

R 字符串循环 DPLYR 数据操作

由于我们定义每个学生有 50/50 的几率失败每个事件，您会发现所有答案在下一个事件中都有 50% 的成功机会，即使他们在前两个事件中失败。这个问题似乎更倾向于假设失败可能预示着未来的性能，但我们定义的随机生成的数据并没有真正让我们弄清楚这一点。查看二项分布可能有助于您获得所需的内容。

答：

4赞 geom_na 12/10/2022 #1

P(A|B) = P(B|A)*P(A)/P(B)

其中 A 是考试失败的概率，B 是考试失败的概率。nn-1 and n-2

我们可以写一个函数来计算P(A|B)

d <- my_data
d %>% 
  dplyr::pivot_wider(id, names_from = "exam_number", values_from = "results") -> dd

p_fail_given_failfail <- function(n){ #n is integer > 2
  #check if the student took the exam n. Remove them if they didn't.
  indx.na <- is.na(dd[, n+1])
  dd <- dd[!indx.na, ]
  
  #Calculate probabilities
  p_BA <- nrow(dd[dd[, n+1]==0 & dd[, n]==0 & dd[, n-1]==0, ]) /nrow(dd[dd[, n+1]==0, ])
  p_A <- nrow(dd[dd[, n+1]==0, ]) / nrow(dd)
  p_B <- nrow(dd[dd[, n]==0 & dd[, n-1]==0, ]) / nrow(dd)

  p_AB <- p_BA*p_A/p_B
  return(p_AB)
}

p_fail_given_failfail(3) #prob of failing exam3, given failed exam2 and exam1
#0.5084479

## Define parameters
students = 10000L
tests = 3L

## simulate tests. 
##  Generates students x tests matrix of test results.
test_results = replicate(tests,
                         sample(c(TRUE,FALSE),
                                students,
                                replace = TRUE))

## determine student set that were failures up to test - 1
complete_failures = rowSums(test_results[, seq_len(tests-1), drop = FALSE]) == 0L
final_test = test_results[complete_failures, tests]

## summarize results
prop.table(table(final_test))
#> final_test
#>     FALSE      TRUE 
#> 0.5150162 0.4849838

OP可能对二项分布更感兴趣。也就是说，在 n 次试验中，只有 1 次成功的概率是多少。在这种情况下，我们可以使用类似于以下内容：pbinom

pbinom(0.5, 1, 0.5)
#> [1] 0.5
pbinom(0.5, 2, 0.5)
#> [1] 0.25
pbinom(0.5, 3, 0.5)
#> [1] 0.125

0赞 Sutanu Majumdar 12/20/2022 #3

同意@cole但可能是负二项式（R 包 rnbinom （p，n，r），因为 n = 第一次通过前的考试次数（r=1），p=0.5（每次考试成功的概率），每次考试都是 i.i.d 伯努利。

计算学生连续两次考试不及格的概率？

Calculating the probability that a student fails two consecutive exams?

评论

评论

评论