操作嵌套 DF 和列表中的数据-解网

问：

我想研究一些股票或金融指数。我使用软件包yfR和yf_get功能从雅虎财经下载数据。此函数返回包含大量变量的 df。我想选择其中的一些，它们只使用所需的变量创建一个 df。这是我的问题：

library(yfR)
library(tidyverse)

Symbols <- c("^GSPC", "^FTSE")

StartDate <- "2010-01-01"
EndDate <- "2019-12-31"

RawData <- yf_get (Symbols, first_date = StartDate, last_date = EndDate, freq_data = "daily" ,do_complete_data = TRUE)

# Here is the initial structure of the RawDat df

str(RawData)

tibble [5,039 × 11] (S3: tbl_df/tbl/data.frame)
 $ ticker                : chr [1:5039] "^FTSE" "^FTSE" "^FTSE" "^FTSE" ...
 $ ref_date              : Date[1:5039], format: "2010-01-04" "2010-01-05" "2010-01-06" "2010-01-07" ...
 $ price_open            : num [1:5039] 5413 5500 5522 5530 5527 ...
 $ price_high            : num [1:5039] 5500 5536 5536 5552 5549 ...
 $ price_low             : num [1:5039] 5411 5481 5498 5500 5495 ...
 $ price_close           : num [1:5039] 5500 5522 5530 5527 5534 ...
 $ volume                : num [1:5039] 7.51e+08 1.15e+09 9.98e+08 1.16e+09 1.01e+09 ...
 $ price_adjusted        : num [1:5039] 5500 5522 5530 5527 5534 ...
 $ ret_adjusted_prices   : num [1:5039] NA 0.004036 0.001358 -0.000597 0.001357 ...
 $ ret_closing_prices    : num [1:5039] NA 0.004036 0.001358 -0.000597 0.001357 ...
 $ cumret_adjusted_prices: num [1:5039] 1 1 1.01 1 1.01 ...
 - attr(*, "df_control")= tibble [2 × 5] (S3: tbl_df/tbl/data.frame)
  ..$ ticker              : chr [1:2] "^FTSE" "^GSPC"
  ..$ dl_status           : chr [1:2] "OK" "OK"
  ..$ n_rows              : int [1:2] 2524 2515
  ..$ perc_benchmark_dates: num [1:2] 0.982 1
  ..$ threshold_decision  : chr [1:2] "KEEP" "KEEP"

请注意，我们想要的指数或股票可能会因为一些国定假日等原因而有不同的长度（不同的 ob 数量）。所以现在我们有 2515 个 GSPC 和 2524 个 FTSE。假设我有兴趣保留列 ref_date、price_adjusted 和 ticker（以便以后以某种方式用作过滤机制）。我试着管道直到某个点，它是这样的：

Returns <- RawData %>% 
    select(ref_date, price_adjusted, ticker) %>% 
    rename(Date = ref_date, Price = price_adjusted, Ticker = ticker)


# And we end up with this 

str(Returns)

tibble [5,039 × 3] (S3: tbl_df/tbl/data.frame)
 $ Date  : Date[1:5039], format: "2010-01-04" "2010-01-05" "2010-01-06" "2010-01-07" ...
 $ Price : num [1:5039] 5500 5522 5530 5527 5534 ...
 $ Ticker: chr [1:5039] "^FTSE" "^FTSE" "^FTSE" "^FTSE" ...
 - attr(*, "df_control")= tibble [2 × 5] (S3: tbl_df/tbl/data.frame)
  ..$ ticker              : chr [1:2] "^FTSE" "^GSPC"
  ..$ dl_status           : chr [1:2] "OK" "OK"
  ..$ n_rows              : int [1:2] 2524 2515
  ..$ perc_benchmark_dates: num [1:2] 0.982 1
  ..$ threshold_decision  : chr [1:2] "KEEP" "KEEP"

我的问题来了。我希望最终产品是具有 4 列（Date_Stock1、Price_Stock1、Date_Stock2、Price_Stock2）的 df。如果我有 3 只股票和 3 个变量，最终产品将是具有 9 列（Date_Stock1、Price_Stock1、Volume_Stock1、Date_Stock2、Price_Stock2、Volume_Stock1、Date_Stock3、Price_Stock3、Volume_Stock3）的 df

我尝试使用过滤器和子集 fron tidyr，但我失败了。我最好的尝试是使用 pivot_wider，结果是一个有 4 列和 1 行的 df，在里面我得到了带有值的列表，我不知道如何将它们恢复为 df。

Returns <- RawData %>%
  select(ref_date, price_adjusted, ticker) %>% 
  rename(Date = ref_date, Price = price_adjusted, Ticker = ticker) %>% 
  pivot_wider(names_from = "Ticker", values_from = c(Date, Price))

# Also received this warning
Warning message:
Values from `Date` and `Price` are not uniquely identified; output will contain list-cols.
• Use `values_fn = list` to suppress this warning.
• Use `values_fn = {summary_fun}` to summarise duplicates.
• Use the following dplyr code to identify duplicates.
  {data} %>%
  dplyr::group_by(Ticker) %>%
  dplyr::summarise(n = dplyr::n(), .groups = "drop") %>%
  dplyr::filter(n > 1L) 

str(Returns)

tibble [1 × 4] (S3: tbl_df/tbl/data.frame)
 $ Date_^FTSE :List of 1
  ..$ : Date[1:2524], format: "2010-01-04" "2010-01-05" "2010-01-06" "2010-01-07" ...
 $ Date_^GSPC :List of 1
  ..$ : Date[1:2515], format: "2010-01-04" "2010-01-05" "2010-01-06" "2010-01-07" ...
 $ Price_^FTSE:List of 1
  ..$ : num [1:2524] 5500 5522 5530 5527 5534 ...
    $ Price_^GSPC:List of 1
     ..$ : num [1:2515] 1133 1137 1137 1142 1145 ...

我怎样才能达到我的最终目标？某种突变或 for 循环，或者我不知道地图。我不知道如何处理这些功能。我只看过教程，但我不会让它们发挥作用。有什么想法吗？

r dplyr tidyr 数据操作

我明白你说什么，但这样想。周一至周四，伦敦和纽约交易所都开放，但周五恰好是 7 月 4 日，所以纽约交易所不开放，我不会在那个日期有 OBS。但是，伦敦交易所是开放的。因此，即使整体差异很小，并且 99% 的 OB 具有相同的日期，您也不能只摄入 1%。每个变量必须具有唯一的 Date 列。拥有共同日期的唯一方法是所有股票都来自同一个国家。

0赞 Nir Graham 11/3/2023

你会有日期，7 月 4 日;伦敦将在那里有一个条目，纽约将是 NA。不会忽略任何内容。纽约没有 7 月 4 日的数据，这将反映在您的数据中......我简直不明白你打算除了这个之外还有什么？在你最初的概念中想象的行/观察值有任何意义还是任意的？典型的一行/观察应该传达一些关于共性的信息。

0赞 Hans Larsen 11/4/2023

是的，你是对的。你可以有一个共同的约会，但这是一个好主意吗？我认为插入 NA 值会破坏我之后的分析

答：

0赞 Mox 11/3/2023 #1

要将包含值的列表转换回数据帧，请执行以下操作：

bro.df<-do.call(rbind.data.frame, data_list)

0赞 thiagomagero 11/4/2023 #2

您面临的挑战涉及将数据集从长格式重组为宽格式，同时由于观测值数量不同，处理每个股票/指数的不同长度。在处理这种性质的时间序列数据时，这是一个常见问题。对于此任务，从 tidyverse pivot_wider的函数确实是一个不错的选择，但是，由于观测值的数量不同，正如您所观察到的那样，直接应用pivot_wider会导致每个单元格中的列表。

解决这个问题的一个好方法是将每个股票/指数的数据分成单独的数据框，确保它们具有相同数量的观测值（必要时用 NA 填充缺失的日期），然后将它们按列绑定在一起。这是你如何做到的：

library(yfR)
library(tidyverse)

# Define the symbols and date range
Symbols <- c("^GSPC", "^FTSE")
StartDate <- "2010-01-01"
EndDate <- "2019-12-31"

# Get the raw data from Yahoo Finance
RawData <- yf_get(Symbols, first_date = StartDate, last_date = EndDate, freq_data = "daily", do_complete_data = TRUE)

# Select and rename the columns of interest
Returns <- RawData %>% 
  select(ref_date, price_adjusted, ticker) %>% 
  rename(Date = ref_date, Price = price_adjusted, Ticker = ticker)

# Split the data into separate data frames for each stock/index
list_of_dfs <- split(Returns, Returns$Ticker)

# Ensure each data frame has the same number of observations by filling missing dates with NA
# First, create a sequence of dates that covers the entire range for both stocks
all_dates <- seq.Date(min(sapply(list_of_dfs, function(df) min(df$Date))),
                      max(sapply(list_of_dfs, function(df) max(df$Date))),
                      by = "day")

# Now, for each stock/index data frame, ensure it has a row for each date in all_dates
list_of_dfs <- lapply(list_of_dfs, function(df) {
  df <- df %>%
    full_join(data.frame(Date = all_dates), by = "Date") %>%
    arrange(Date)
  return(df)
})

# Now bind the separate data frames together column-wise
result <- NULL
for (i in seq_along(list_of_dfs)) {
  df <- list_of_dfs[[i]]
  # Create column names based on the stock/index ticker
  colnames(df)[2] <- paste0("Price_", df$Ticker[1])
  colnames(df)[1] <- paste0("Date_", df$Ticker[1])
  df$Ticker <- NULL  # remove the Ticker column as it's no longer needed
  if (is.null(result)) {
    result <- df
  } else {
    result <- bind_cols(result, df)
  }
}

# View the resulting data frame
str(result)

在此代码片段中：

我们首先使用拆分函数将 Returns 拆分为数据框列表，每个股票/指数一个数据框。
然后，我们创建一个日期序列all_dates，涵盖两个股票/指数的整个日期范围。
我们使用 lapply 遍历 list_of_dfs 中的每个数据框，对于每个数据框，我们使用 full_join 来确保它对all_dates中的每个日期都有一行，为缺少的日期填写 NA。
然后，我们使用 for 循环list_of_dfs遍历每个数据帧，并使用 bind_cols 将它们逐列绑定到最终结果中。
我们还会根据您的要求调整循环中的列名称以反映股票/指数代码。

这将为您提供一个数据框结果，其中包含每个股票/指数的日期和价格的单独列，以及每个股票/指数的相同行数，并填写 NA 以表示缺失的日期。

上一个：如何在 R 中将列转置为行并确保相应地重复行？[复制]

下一个：如何在Python中将列中的数据拆分为一些单独的列？

操作嵌套 DF 和列表中的数据

manipulate data from nested df and lists

评论