提问人:stats_noob 提问时间:12/10/2022 最后编辑:stats_noob 更新时间:12/22/2022 访问量:384
计算学生连续两次考试不及格的概率?
Calculating the probability that a student fails two consecutive exams?
问:
我正在使用 R 编程语言。我有以下数据集 - 学生多次参加考试,他们要么通过(“1”),要么不及格(“0”)。数据如下所示:
id = sample.int(10000, 100000, replace = TRUE)
res = c(1,0)
results = sample(res, 100000, replace = TRUE)
date_exam_taken = sample(seq(as.Date('1999/01/01'), as.Date('2020/01/01'), by="day"), 100000, replace = TRUE)
id results date_exam_taken exam_number
7992 1 1 2004-04-23 1
24837 1 0 2004-12-10 2
12331 1 1 2007-01-19 3
34396 1 0 2007-02-21 4
85250 1 0 2007-09-26 5
11254 1 1 2009-12-20 6
my_data = data.frame(id, results, date_exam_taken)
my_data <- my_data[order(my_data$id, my_data$date_exam_taken),]
my_data$general_id = 1:nrow(my_data)
my_data$exam_number = ave(my_data$general_id, my_data$id, FUN = seq_along)
my_data$general_id = NULL
我很想知道 - 假设一个学生考试不及格,这个学生下一次考试不及格的概率是多少?(例如,第一次考试不及格 - 第二次考试不及格的概率是多少?第 5 次考试不及格 - 第 6 次考试不及格的概率是多少?我写了以下循环来回答这个问题:
my_list = list()
for (i in 1:length(unique(my_data$id)))
{
{tryCatch({
start_i = my_data[my_data$id == i,]
pairs_i = data.frame(first = head(start_i$results, -1), second = tail(start_i$results, -1))
frame_i = as.data.frame(table(pairs_i))
frame_i$id = i
print(frame_i)
my_list[[i]] = frame_i
}, error = function(e){})
}}
final = do.call(rbind.data.frame, my_list)
#################################################
library(dplyr)
total_1 = final %>% group_by(first, second) %>% summarise(totals = n())
total_2 = total_1 %>% group_by(first) %>% summarise(sum = sum(totals))
join = merge(x = total_1, y = total_2, by = "first", all = TRUE)
join$probs = join$totals/join$sum
最终答案如下所示:
first second totals sum probs
1 0 0 9817 19557 0.5019686
2 0 1 9740 19557 0.4980314
3 1 0 9727 19498 0.4988717
4 1 1 9771 19498 0.5011283
现在,我正在尝试修改上面的代码,以便在“第二级”进行分析 - 也就是说,我想找出学生在前两次考试中失败的情况下,下一次考试失败的概率?例如,假设学生在第 3 次和第 4 次考试中失败 - 学生在第 5 次考试中失败的概率是多少?
我认为最终的答案应该是这样的:
# note: "first" and "second" refer to any consecutive exams (not the literal first and second exam), and "third" refers to the exam occurring immediately after the "second" exam
first second third totals sums probs
1 1 1 1 ... ... ...
2 0 0 0 ... ... ...
3 1 0 1 ... ... ...
4 0 1 0 ... ... ...
5 0 0 1 ... ... ...
6 1 1 0 ... ... ...
7 0 1 1 ... ... ...
8 1 0 0 ... ... ...
我尝试手动修改我的代码以满足以下要求:
library(stringr)
my_list = list()
for (i in 1:length(unique(my_data$id)))
{
{tryCatch({
start_i = my_data[my_data$id == i,]
vals_i = as.numeric(paste(start_i$results, collapse = ""))
L_1_i = lengths(gregexpr("111", vals_i))
L_2_i = lengths(gregexpr("000", vals_i))
L_3_i = lengths(gregexpr("101", vals_i))
L_4_i = lengths(gregexpr("010", vals_i))
L_5_i = lengths(gregexpr("001", vals_i))
L_6_i = lengths(gregexpr("110", vals_i))
L_7_i = lengths(gregexpr("011", vals_i))
L_8_i = lengths(gregexpr("100", vals_i))
frame_i = data.frame(class = c("111","000","101","010","001","110","011","100"), values = c(L_1_i, L_2_i, L_3_i, L_4_i, L_5_i, L_6_i, L_7_i, L_8_i))
frame_i$id = i
print(frame_i)
my_list[[i]] = frame_i
}, error = function(e){})
}}
final = do.call(rbind.data.frame, my_list)
final$first = substr(final$class, 1,1)
final$second = substr(final$class, 2,2)
final$third = substr(final$class, 3,3)
total_1 = final %>% group_by(first, second, third) %>% summarise(totals = n())
total_2 = total_1 %>% group_by(first, second) %>% summarise(sum = sum(totals))
join = merge(x = total_1, y = total_2, by = c("first", "second"), all = TRUE)
join$probs = join$totals/join$sum
举个例子 - 为了回答原来的问题,我们现在要查询结果(假设我做对了)。以下是学生通过考试的概率,因为该学生在前两次考试中都失败了:
first second third totals sum probs
1 0 0 0 10000 20000 0.5
2 0 0 1 10000 20000 0.5
3 0 1 0 10000 20000 0.5
4 0 1 1 10000 20000 0.5
5 1 0 0 10000 20000 0.5
6 1 0 1 10000 20000 0.5
7 1 1 0 10000 20000 0.5
8 1 1 1 10000 20000 0.5
join[join$first == 0 & join$second ==0, & join$third == 1,]
我不确定我所做的是否正确 - 有人可以帮我纠正我的代码吗(并可能向我展示一种“更好”的方法来做到这一点)?
注意:这些概率看起来是两个“统一”的,让我倾向于认为我做错了什么......
答:
P(A|B) = P(B|A)*P(A)/P(B)
其中 A 是考试失败的概率,B 是考试失败的概率。n
n-1 and n-2
我们可以写一个函数来计算P(A|B)
d <- my_data
d %>%
dplyr::pivot_wider(id, names_from = "exam_number", values_from = "results") -> dd
p_fail_given_failfail <- function(n){ #n is integer > 2
#check if the student took the exam n. Remove them if they didn't.
indx.na <- is.na(dd[, n+1])
dd <- dd[!indx.na, ]
#Calculate probabilities
p_BA <- nrow(dd[dd[, n+1]==0 & dd[, n]==0 & dd[, n-1]==0, ]) /nrow(dd[dd[, n+1]==0, ])
p_A <- nrow(dd[dd[, n+1]==0, ]) / nrow(dd)
p_B <- nrow(dd[dd[, n]==0 & dd[, n-1]==0, ]) / nrow(dd)
p_AB <- p_BA*p_A/p_B
return(p_AB)
}
p_fail_given_failfail(3) #prob of failing exam3, given failed exam2 and exam1
#0.5084479
评论
这种方法概括了以下问题的答案:“如果我有一个抛硬币分布,那么在前 n 个事件失败后,测试失败 n + 1 测试的概率是多少”。
## Define parameters
students = 10000L
tests = 3L
## simulate tests.
## Generates students x tests matrix of test results.
test_results = replicate(tests,
sample(c(TRUE,FALSE),
students,
replace = TRUE))
## determine student set that were failures up to test - 1
complete_failures = rowSums(test_results[, seq_len(tests-1), drop = FALSE]) == 0L
final_test = test_results[complete_failures, tests]
## summarize results
prop.table(table(final_test))
#> final_test
#> FALSE TRUE
#> 0.5150162 0.4849838
OP可能对二项分布更感兴趣。也就是说,在 n 次试验中,只有 1 次成功的概率是多少。在这种情况下,我们可以使用类似于以下内容:pbinom
pbinom(0.5, 1, 0.5)
#> [1] 0.5
pbinom(0.5, 2, 0.5)
#> [1] 0.25
pbinom(0.5, 3, 0.5)
#> [1] 0.125
同意@cole但可能是负二项式(R 包 rnbinom (p,n,r),因为 n = 第一次通过前的考试次数 (r=1),p=0.5(每次考试成功的概率),每次考试都是 i.i.d 伯努利。
评论