使用 group_by 功能显示每个类别的前 5 个关键字

Displaying top 5 keywords for every category using group_by function

提问人:kartik trivedi 提问时间:11/6/2023 最后编辑:Markkartik trivedi 更新时间:11/7/2023 访问量:46

问:

我正在尝试为我拥有以下代码的每个类别的产品在评论中找到前 5 个关键字

# Group by category and count keyword frequencies
keyword_counts <- filtered_data %>%
  group_by(category, keyword) %>%
  summarise(n = n()) %>%
  arrange(desc(n))

# Find the top 5 keywords in each category
top_keywords_by_category <- keyword_counts %>%
  group_by(category) %>%
  top_n(5, wt = n) %>%
  ungroup()  # Ungroup the data

# Print the table
print(top_keywords_by_category)

提供此输出的

category                                                        keyword     n
   <chr>                                                           <chr>   <int>
 1 Computers&Accessories|Accessories&Peripherals|Cables&Accessori… product   354
 2 Computers&Accessories|Accessories&Peripherals|Cables&Accessori… cable     277
 3 Computers&Accessories|Accessories&Peripherals|Cables&Accessori… chargi…   200
 4 Computers&Accessories|Accessories&Peripherals|Cables&Accessori… quality   179
 5 Computers&Accessories|Accessories&Peripherals|Cables&Accessori… nice      147
 6 Electronics|WearableTechnology|SmartWatches                     watch     129
 7 Electronics|Mobiles&Accessories|Smartphones&BasicMobiles|Smart… phone     127
 8 Electronics|HomeTheater,TV&Video|Televisions|SmartTelevisions   tv        117
 9 Electronics|WearableTechnology|SmartWatches                     product   102
10 Electronics|HomeTheater,TV&Video|Televisions|SmartTelevisions   product    80

虽然我想要的结果

Category Computers&Accessories
Keyword             n
1 Product          354
2 Cable            277
3 Chargi...        200
4 Quality          179
5 Nice             147
r dplyr group-by tokenize summarize

评论

1赞 jpsmith 11/6/2023
您能否编辑您的问题以提供一个示例?比如包括 ?filtered_datadput(head(filtered_data, 25))
1赞 r2evans 11/6/2023
Kartiktrivedi,您已经成为会员 2 个月了,对于您的大多数问题,有人评论请添加具有代表性的示例数据,方法是包含指向最小可重现示例的链接(您也可以阅读 stackoverflow.com/q/5963269,许多其他讨论/示例)或明确建议使用 .请注意,花时间尝试查找您的数据或将其解析为我们实际可以使用的东西是一种有形的时间消耗,通常使我(也许是其他人)甚至无法尝试提供帮助。祝你好运。dput(.)
0赞 r2evans 11/6/2023
此外,电子标签与这个问题有什么关系?如果你把鼠标悬停在它上面,它自我描述为与“框架......使用 HTML、CSS 和 Javascript 编写跨平台桌面应用程序“,这似乎与您的问题中的任何内容都不匹配。Stack 的标签推荐系统并不完善,请始终检查推荐的内容,并限制在相关的内容上。

答:

0赞 r2evans 11/6/2023 #1

虽然这些数据无趣,但它应该向您展示如何使用 .tidyr::separate_rows

quux <- structure(list(category = c("Computers&Accessories|Accessories&Peripherals|Cables&Accessori…", "Computers&Accessories|Accessories&Peripherals|Cables&Accessori…", "Computers&Accessories|Accessories&Peripherals|Cables&Accessori…", "Computers&Accessories|Accessories&Peripherals|Cables&Accessori…", "Computers&Accessories|Accessories&Peripherals|Cables&Accessori…", "Electronics|WearableTechnology|SmartWatches", "Electronics|Mobiles&Accessories|Smartphones&BasicMobiles|Smart…", "Electronics|HomeTheater,TV&Video|Televisions|SmartTelevisions",  "Electronics|WearableTechnology|SmartWatches", "Electronics|HomeTheater,TV&Video|Televisions|SmartTelevisions"),
                       keyword = c("product", "cable", "chargi…", "quality", "nice", "watch", "phone", "tv", "product", "product"), 
                       n = c(354L, 277L, 200L, 179L, 147L, 129L, 127L, 117L, 102L, 80L)), class = "data.frame", row.names = c("1", "2", "3", "4", "5", "6", "7", "8", "9", "10"))

library(dplyr)
quux %>%
  tidyr::separate_rows(category, sep = "\\|") %>%
  count(category, keyword) %>%
  arrange(desc(n))
# # A tibble: 32 × 3
#    category                keyword     n
#    <chr>                   <chr>   <int>
#  1 Electronics             product     2
#  2 Accessories&Peripherals cable       1
#  3 Accessories&Peripherals chargi…     1
#  4 Accessories&Peripherals nice        1
#  5 Accessories&Peripherals product     1
#  6 Accessories&Peripherals quality     1
#  7 Cables&Accessori…       cable       1
#  8 Cables&Accessori…       chargi…     1
#  9 Cables&Accessori…       nice        1
# 10 Cables&Accessori…       product     1
# # ℹ 22 more rows
# # ℹ Use `print(n = ...)` to see more rows

从这里,您可以进行前 5 名的过滤和透视:

quux %>%
  tidyr::separate_rows(category, sep = "\\|") %>%
  count(category, keyword) %>%
  slice_max(n = 5, order_by = n, with_ties = FALSE) %>%
  tidyr::pivot_wider(names_from = category, values_from = n, values_fill = list(n = 0))
# # A tibble: 4 × 3
#   keyword Electronics `Accessories&Peripherals`
#   <chr>         <int>                     <int>
# 1 product           2                         1
# 2 cable             0                         1
# 3 chargi…           0                         1
# 4 nice              0                         1