我可以比较一列中值的频率吗？-解网

问：

因此，假设我有一个单列数据集：此列是一个具有 5 个级别（a、b、c、d、e）的分类变量。如何比较每个级别的频率？有没有办法做到这一点？谢谢。

我试过了，但没能解决

r 比较比例

差一点！此外，这并不能告诉我们哪个群体的代表性过高。为此，我们可以做一个非常愚蠢的暴力排列测试：在与原始数据一样多的试验中随机抽样组变量，1000 次，并计算出每个组的模拟计数大于观察到的计数的频率。如果随机化给出的给定组的计数大于实际数据中显示的计数，则该组的代表性可能未过高。

# initialize permutation count columns
df_counts$n_greater <- rep(0, nrow(df_counts))
df_counts$n_lesser <- rep(0, nrow(df_counts))
set.seed(123)  # for reproducible "randomness"
# simulate 1000 random apportionments of group memberships to the observed number of trials
n_permut <- 1000
for(i in 1:n_permut) {
  # random "draw" of group variables
  sim <- sample(df_counts$Var1, nrow(df), replace=T)
  sim_df <- as.data.frame(table(sim))
  # for each group, was the number of randomized calls greater or lesser than observed?
  # increment counters accordingly
  df_counts$n_greater <- df_counts$n_greater + as.numeric(sim_df$Freq > df_counts$Freq)
  df_counts$n_lesser <- df_counts$n_lesser + as.numeric(sim_df$Freq < df_counts$Freq)
}
# the permutation test p-values are simply the proportion of simulations with greater or lesser counts
df_counts$p_greater <- df_counts$n_greater/n_permut
df_counts$p_lesser <- df_counts$n_lesser/n_permut
# we will use Bonferroni correction on the p-values, because of the multiple comparisons that we've performed
df_counts$p_greater <- p.adjust(df_counts$p_greater, method='bonferroni', n=nrow(df_counts) * 2)
df_counts$p_lesser <- p.adjust(df_counts$p_lesser, method='bonferroni', n=nrow(df_counts) * 2)
print(df_counts)

  Var1 Freq       prop n_greater n_lesser p_greater p_lesser
1    a    2 0.08695652       867       49      1.00     0.49
2    b    4 0.17391304       521      287      1.00     1.00
3    c    4 0.17391304       514      292      1.00     1.00
4    d    3 0.13043478       672      157      1.00     1.00
5    e   10 0.43478261         1      990      0.01     1.00

因此，通过这种相当基本的方法，组具有高度显着的过度表示 p 值，而其他组无论哪种方式都不显著。e

你可以调用来执行卡方检验，将你的分布与理论上的相等分布进行比较，但这只会告诉你你的分布是不均匀的（或者，如果你把它与另一个已知分布分开，它会告诉你两者可能不相同）。它不会告诉你哪个群体在推动差异，我也不是 100% 确定哪个测试会告诉你这一点。这可能是 Stack Exchange 交叉验证站点的问题，该站点以统计信息为重点。chisq.test(df_counts$Freq)

0赞 C. Murtaugh 6/30/2023

实际上，人们可以做一个排列测试，询问给定组的评分频率与给定数量的试验中的评分频率如何。像我一样，这是非常愚蠢和简单化的，但也没有对你的数据分布做出很多假设。我将相应地修改我的答案。

上一个：BigQuery 比较两个列表

下一个：找到两个列表/集并集大小的最快方法？编码效率

我可以比较一列中值的频率吗？

Can i compare frequency of values within one column?

评论

评论