如何合并多个变量并创建新的数据集?

How to merge multiple variables and create a new data set?

提问人:Sri Sreshtan 提问时间:4/16/2020 更新时间:4/16/2020 访问量:166

问:

https://www.kaggle.com/nowke9/ipldata ----- 包含 IPL 数据。

这是对 IPL 数据集进行的探索性研究。(上面所附数据的链接)在将两个文件与“id”和“match_id”合并后,我又创建了四个变量,即 total_extras、total_runs_scored、total_fours_hit 和 total_sixes_hit。现在,我希望将这些新创建的变量合并到一个数据框中。当我将这些变量分配给一个变量(即 batsman_aggregate 并仅选择所需的列)时,我收到一条错误消息。

    library(tidyverse)
    deliveries_tbl <- read.csv("deliveries_edit.csv")
    matches_tbl <- read.csv("matches.csv")

    combined_matches_deliveries_tbl <- deliveries_tbl %>%
    left_join(matches_tbl, by = c("match_id" = "id"))

    # Add team score and team extra columns for each match, each inning.
    total_score_extras_combined <- combined_matches_deliveries_tbl%>%
    group_by(id, inning, date, batting_team, bowling_team, winner)%>%
    mutate(total_score = sum(total_runs, na.rm = TRUE))%>%
    mutate(total_extras = sum(extra_runs, na.rm = TRUE))%>%
    group_by(total_score, total_extras, id, inning, date, batting_team, bowling_team, winner)%>%
    select(id, inning, total_score, total_extras, date, batting_team, bowling_team, winner)%>%
    distinct(total_score, total_extras)%>%
    glimpse()%>%
    ungroup()


# Batsman Aggregate (Runs Balls, fours, six , Sr)
# Batsman score in each match
batsman_score_in_a_match <- combined_matches_deliveries_tbl %>%
    group_by(id, inning, batting_team, batsman)%>%
    mutate(total_batsman_runs = sum(batsman_runs, na.rm = TRUE))%>%
    distinct(total_batsman_runs)%>%
    glimpse()%>%
        ungroup()

# Number of deliveries played . 
balls_faced <- combined_matches_deliveries_tbl %>%
    filter(wide_runs == 0)%>%
    group_by(id, inning, batsman)%>%
    summarise(deliveries_played = n())%>%
    ungroup()

# Number of 4 and 6s by a batsman in each match.
fours_hit <- combined_matches_deliveries_tbl %>%
    filter(batsman_runs == 4)%>%
    group_by(id, inning, batsman)%>%
    summarise(fours_hit = n())%>%
    glimpse()%>%
    ungroup()

sixes_hit <- combined_matches_deliveries_tbl %>%
    filter(batsman_runs == 6)%>%
    group_by(id, inning, batsman)%>%
    summarise(sixes_hit = n())%>%
    glimpse()%>%
    ungroup()

batsman_aggregate <- c(batsman_score_in_a_match, balls_faced, fours_hit, sixes_hit)%>%
    select(id, inning, batsman, total_batsman_runs, deliveries_played, fours_hit, sixes_hit)

错误消息显示为:-

Error: `select()` doesn't handle lists.

所需的输出是新构造的变量创建的数据集。

r 列表 选择 dplyr

评论

0赞 Edward 4/16/2020
您使用的最后一个命令是问题所在。你可能想以某种方式加入他们。c

答:

1赞 Edward 4/16/2020 #1

您必须联接这四个表,而不是使用 .c

并且连接类型是为了让所有击球手都包含在输出中。那些没有面对任何球或击中任何边界的人将有 NA,但您可以轻松地将它们替换为 0。left_join

我忽略了 因为 dplyr 会假设你想要,所有四个数据集中只有 3 个公共列。byc("id", "inning", "batsman")

batsman_aggregate <- left_join(batsman_score_in_a_match, balls_faced) %>%
  left_join(fours_hit) %>%
  left_join(sixes_hit) %>%
  select(id, inning, batsman, total_batsman_runs, deliveries_played, fours_hit, sixes_hit) %>%
  replace(is.na(.), 0)

# A tibble: 11,335 x 7
      id inning batsman       total_batsman_runs deliveries_played fours_hit sixes_hit
   <int>  <int> <fct>                      <int>             <dbl>     <dbl>     <dbl>
 1     1      1 DA Warner                     14                 8         2         1
 2     1      1 S Dhawan                      40                31         5         0
 3     1      1 MC Henriques                  52                37         3         2
 4     1      1 Yuvraj Singh                  62                27         7         3
 5     1      1 DJ Hooda                      16                12         0         1
 6     1      1 BCJ Cutting                   16                 6         0         2
 7     1      2 CH Gayle                      32                21         2         3
 8     1      2 Mandeep Singh                 24                16         5         0
 9     1      2 TM Head                       30                22         3         0
10     1      2 KM Jadhav                     31                16         4         1
# ... with 11,325 more rows

还有 2 名击球手没有面临任何交付:

batsman_aggregate %>% filter(deliveries_played==0)
# A tibble: 2 x 7
     id inning batsman        total_batsman_runs deliveries_played fours_hit sixes_hit
  <int>  <int> <fct>                       <int>             <dbl>     <dbl>     <dbl>
1   482      2 MK Pandey                       0                 0         0         0
2  7907      1 MJ McClenaghan                  2                 0         0         0

其中一个显然得分了 2 分!所以我认为该列有一些错误。比赛就在这里,清楚地表明,在第一局的倒数第二局中,打进了 2 个边路,而不是击球手。batsman_runs