在“R”和 duckdb 中读取大于 5 GB 的 csv 文件时出错

Error reading csv files larger than 5 GB in 'R' and duckdb

提问人:Aaron 提问时间:11/2/2023 更新时间:11/2/2023 访问量:73

问:

我将在 duckdb 中加载每个数据集超过 5Gb 的数据集。我需要一点帮助。我在 VS Code 编辑器中启动 R。几分钟后,r 停止并给出消息,重新打开窗口。我有一个空的example.wal文件。duckbd 数据库的大小为 12 kB。数据集的输出是带标题的 3 列。

谢谢你的帮助。

# Add libraries
library(duckdb)
library(dplyr)
library(DBI)

# write to disk as "Example", other defaults to in memory
con <- DBI::dbConnect(duckdb::duckdb(), "Example")
duckdb::duckdb_read_csv(
    conn = con, name = "Example_csv", files = "data/more/Example-2022.csv",
    header = TRUE, delim = ",", na.strings = "NA"
)

DBI::dbListTables(con)

当我使用的数据集少于数据时,我收到以下错误消息:

Error: rapi_execute: Failed to run query
Error: Invalid Input Error: Error in file "example.csv", on line 3: expected 1 values per row, but got more. (  file=example.csv
  delimiter=','
  quote='"'
  escape='"' (default)
  header=1
  sample_size=20480
  ignore_errors=0
  all_varchar=0)
In addition: Warning messages:
1: In read.table(file = file, header = header, sep = sep, quote = quote,  :
  line 1 appears to contain embedded nulls
2: In read.table(file = file, header = header, sep = sep, quote = quote,  :
  line 2 appears to contain embedded nulls
3: In read.table(file = file, header = header, sep = sep, quote = quote,  :
  line 3 appears to contain embedded nulls
4: In read.table(file = file, header = header, sep = sep, quote = quote,  :
  line 4 appears to contain embedded nulls
5: In read.table(file = file, header = header, sep = sep, quote = quote,  :
  line 5 appears to contain embedded nulls
6: In scan(file = file, what = what, sep = sep, quote = quote, dec = dec,  :
  embedded nul(s) found in input
7: Database is garbage-collected, use dbDisconnect(con, shutdown=TRUE) or duckdb::duckdb_shutdown(drv) to avoid this. 
> Error: Invalid Input Error: Error in file "example.csv", on line 3: expected 1 values per row, but got more.

数据集中的一些行:

DateTime,Beta,Alpha
01/02/2022 22:03:13.151,0.83987,0.84129
01/02/2022 22:05:03.942,0.83959,0.84143
01/02/2022 22:05:09.121,0.83982,0.84124
01/02/2022 22:05:09.286,0.83978,0.8412
r 鸭子数据库

评论

0赞 r2evans 11/2/2023
示例 CSV 不会触发该错误。也许分享 CSV 中的前 5 行(左右)?该错误清楚地表明第 3 行与预期不符......
0赞 Aaron 11/2/2023
这是 csv 文件中的前 5 行。
1赞 r2evans 11/2/2023
所有这些都与你的核心问题无关:你的一个CSV已损坏、无效,或者没有真正的逗号分隔。
1赞 margusl 11/2/2023
你能把这 5 行作为原始向量包括在内吗?例如,输出dput(readr::read_lines_raw("example.csv", n_max = 5))
1赞 r2evans 11/2/2023
如果这 5 个示例行保存到新的 CSV 中,然后读入产生错误,那么我建议先尝试快速读取,以读取前 10 个数据行,看看是还是其他东西。如果没有错误,请尝试添加到表达式中(它应该被传递)。如果失败,请尝试重新安装(而不是 R)。duckdb::duckdb_read_csv(conn = duck, name = "Example_csv", files = "newfile.csv", header = TRUE, delim = ",", na.strings = "NA")read.csv("newfile.csv", nrows=10)arrowread.csvnrows=10duckdb_read_csvarrow

答: 暂无答案