问：

我有一个 csv 文件，其中一列包含一个 numpy 数组。读取 csv 文件时，生成的列将具有字符类型，因为它全部包装在字符串中。我想将其解析为单独的数据帧来分析数据。

输入数据

作为 csv：

first_column,second_column
a,"[[1,2],[3,4]]"
b,"[[5,6],[7,8]]"
c,"[[9,10],[11,12]]"

作为数据帧：

df <- data.frame(first_column  = c("a","b","c"),
                 second_column = c("[[1,2],[3,4]]","[[5,6],[7,8]]","[[9,10],[11,12]]"))

我试过什么

由于我不知道有任何可以从字符串中提取数组的直接解析函数，所以我开始做字符串操作。

删除外部字符：[]

> df %>% mutate(second_column = str_replace_all(second_column, c("^\\[" = "","]$" = "")))
  first_column  second_column
1            a    [1,2],[3,4]
2            b    [5,6],[7,8]
3            c [9,10],[11,12]

但是，从现在开始，我不知道该如何进行。

预期输出

最终生成的数据帧应如下所示：

  col_1 col_2
1     1     2
2     3     4
3     5     6
4     7     8
5     9    10
6    11    12

请注意，实际数据帧中有更多的列和行

R 数组数据帧 csv 嵌套列表

df <- data.frame(first_column  = c("a","b","c"),
                 second_column = c("[[1,2],[3,4]]","[[5,6],[7,8]]","[[9,10],[11,12]]"))

library(tidyverse)

df %>% 
  mutate(second_column = str_replace_all(second_column, c("^\\[" = "","]$" = "")),
         second_column = gsub("\\[|\\]", "", second_column)) %>% 
  separate(second_column, into = c("col_1", "col_2", "col_3", "col_4"), sep = ",") %>% 
  pivot_longer(-first_column) %>% 
  mutate(name = case_when(name == "col_3" ~ "col_1",
                          name == "col_4" ~ "col_2", 
                          .default = name)) %>% 
  select(-first_column) %>% 
  pivot_wider(names_from = name, values_from = value, values_fn = list) %>% 
  unnest(cols = c(col_1, col_2))
  
#> # A tibble: 6 × 2
#>   col_1 col_2
#>   <chr> <chr>
#> 1 1     2    
#> 2 3     4    
#> 3 5     6    
#> 4 7     8    
#> 5 9     10   
#> 6 11    12

1赞 Andre Wildberg 3/8/2023 #2

一种基本 R 方法，用于处理给定列上的任意行数。

setNames(
  data.frame(Vectorize(\(x) as.numeric(x))(
    data.frame(do.call(rbind, 
      sapply(lapply(strsplit(df$second_column, "\\],\\["), 
          gsub, pattern="\\[|\\]", replacement=""), strsplit, ","))))), 
  c("col_1", "col_2"))
  col_1 col_2
1     1     2
2     3     4
3     5     6
4     7     8
5     9    10
6    11    12

3赞 G. Grothendieck 3/8/2023 #3

将出现的 ]，[ 替换为换行符，将方括号替换为空格，并用于读取该值。read.table

df$second_column |>
  gsub("\\],\\[", "\n", x = _) |>
  chartr("[]", "  ", x = _) |>
  read.table(text = _, sep = ",")

给：

#Replace [], with space
. <- gsub("[][,]", " ", df$second_column)
#. <- chartr("[],", "   ",  df$second_column) #Alternativ

#Split at "   " and unlist result
. <- unlist(strsplit(., "   ", fixed=TRUE))
#. <- sub("   ", "\n", ., fixed=TRUE) #Alternativ

#use read.table to get columns
read.table(text = .)
#  V1 V2
#1  1  2
#2  3  4
#3  5  6
#4  7  8
#5  9 10
#6 11 12

或使用trimws

. <- trimws(df$second_column, whitespace = "[][]")

. <- unlist(strsplit(., "],[", fixed=TRUE))
#. <- sub("],[", "\n", ., fixed=TRUE) #Alternativ

read.csv(text=.)

输出

上一个：如何在 Python 中展平嵌套列表，同时保留以逗号分隔的列表元素？

下一个：转换 DataFrame 中的列表列表

如何从 csv/dataframe 列中的字符串解析数组数组

How to parse array of arrays from string in column of csv/dataframe

输入数据

我试过什么

预期输出

评论

评论

评论

输出