提问人:GreenGodot 提问时间:7/27/2023 最后编辑:GreenGodot 更新时间:7/28/2023 访问量:239
将大型 CSV 文件按组平均拆分为较小的 CSV 文件的更快方法?
Faster way to split a large CSV file evenly by Groups into smaller CSV files?
问:
我相信有更好的方法,但我画的是空白。我有一个这种格式的 CSV 文件。ID 列已排序,因此所有内容至少都组合在一起:
Text ID
this is sample text, AAAA
this is sample text, AAAA
this is sample text, AAAA
this is sample text, AAAA
this is sample text, AAAA
this is sample text2, BBBB
this is sample text2, BBBB
this is sample text2, BBBB
this is sample text3, CCCC
this is sample text4, DDDD
this is sample text4, DDDD
this is sample text5, EEEE
this is sample text5, EEEE
this is sample text6, FFFF
this is sample text6, FFFF
我想做的是快速将 CSV 快速拆分为 X 数量的较小 CSV 文件。因此,如果 X==3,那么 AAAA 将进入“1.csv”,BBBB 将进入“2.csv”,CCCC 将进入“3.csv”,下一组将循环返回并进入“1.csv”。
这些组的大小各不相同,因此按数字进行硬编码的拆分在这里不起作用。
有没有比我目前的方法更可靠的方法来可靠地拆分它们,该方法仅在 Python 中使用 Pandas groupby 来编写它们?
file_ = 0
num_files = 3
for name, group in df.groupby(by=['ID'], sort=False):
file_+=1
group['File Num'] = file_
group.to_csv(file_+'.csv',index=False, header=False, mode='a')
if file_ == num_files:
file_ = 0
这是一个基于 python 的解决方案,但如果它完成工作,我对使用或 bash 持开放态度。awk
编辑:
为了澄清起见,我希望将组拆分为我可以设置的固定数量的文件。
在本例中,3.(所以 x = 3)。第一组 (AAAA) 将进入 1.csv,第二组进入 2.csv,第三组进入 3.csv,然后对于第四组,它将循环返回并将其插入到 1.csv 中。等。
示例输出 1.csv:
Text ID
this is sample text, AAAA
this is sample text, AAAA
this is sample text, AAAA
this is sample text, AAAA
this is sample text, AAAA
this is sample text4, DDDD
this is sample text4, DDDD
示例输出 2.csv:
Text ID
this is sample text2, BBBB
this is sample text2, BBBB
this is sample text2, BBBB
this is sample text5, EEEE
this is sample text5, EEEE
示例输出 3.csv:
Text ID
this is sample text3, CCCC
this is sample text6, FFFF
this is sample text6, FFFF
答:
使用您显示的示例,请尝试以下代码。如前所述,考虑到最后一列是根据所示样本排序的。
awk -v x="3" '
BEGIN{
count=1
outFile=count".csv"
}
FNR==1{
print
next
}
prev!=$NF && prev{
close(outFile)
count++
outFile=count".csv"
}
{
print >> (outFile)
prev=$NF
}
x==count{ count=1 }
' Input_file
评论
x==3
您可以使用以下解决方案:awk
awk -v X=3 '
FNR == 1 { # save 1st record as header
hdr = $0
next
}
p != $NF { # ID field changes, move to new output csv file
close(fn)
fn = ((n++ % X) + 1)".csv" # construct new file name
}
!seen[fn]++ { # do we need to print header
print hdr > fn
}
{
print >> fn # append each record to output
p = $NF # save last field in variable p
}' file
使用 groupby
和 factorize
模数 ():N
N = 3
for i, g in df.groupby(pd.factorize(df['ID'])[0]%N):
g.to_csv(f'chunk{i+1}.csv', index=False)
输出文件:
# chunk1.csv
Text,ID
this is sample text,AAAA
this is sample text,AAAA
this is sample text,AAAA
this is sample text,AAAA
this is sample text,AAAA
this is sample text4,DDDD
this is sample text4,DDDD
# chunk2.csv
Text,ID
this is sample text2,BBBB
this is sample text2,BBBB
this is sample text2,BBBB
this is sample text5,EEEE
this is sample text5,EEEE
# chunk3.csv
Text,ID
this is sample text3,CCCC
this is sample text6,FFFF
this is sample text6,FFFF
计时
在 1400 万行上进行了测试:
15.8 s ± 687 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
其中 ~14 秒是由于 I/O
与其他答案的比较(在 shell 中使用):time
# @mozway as a python script including imports and reading the file
real 0m20,834s
# @RavinderSingh13
real 1m22,952s
# @anubhava
real 1m23,790s
# @Ed Morton (updated code, original solution was 2m58,171s)
real 0m8,599s
作为功能:
import pandas as pd
def split_csv(filename, N=3, id_col='ID', out_basename='chunk'):
df = pd.read_csv(filename)
for i, g in df.groupby(pd.factorize(df[id_col])[0]%N):
g.to_csv(f'{out_basename}{i+1}.csv', index=False)
split_csv('my_file.csv', N=3)
评论
mozway.py
python mozway.py
File "mozway.py", line 6
g.to_csv(f'{out_basename}{i+1}.csv', index=False)
'
.csv'
SyntaxError: invalid syntax
Python 2.7.15
在每个 Unix 机器上的任何 shell 中使用任何 awk:
$ cat tst.awk
NR==1 {
hdr = $0
next
}
$NF != prev {
out = (((blockCnt++) % X) + 1) ".csv"
if ( blockCnt <= X ) {
print hdr > out
}
prev = $NF
}
{ print > out }
$ awk -v X=3 -f tst.awk input.csv
$ head [0-9]*.csv
==> 1.csv <==
Text ID
this is sample text, AAAA
this is sample text, AAAA
this is sample text, AAAA
this is sample text, AAAA
this is sample text, AAAA
this is sample text4, DDDD
this is sample text4, DDDD
==> 2.csv <==
Text ID
this is sample text2, BBBB
this is sample text2, BBBB
this is sample text2, BBBB
this is sample text5, EEEE
this is sample text5, EEEE
==> 3.csv <==
Text ID
this is sample text3, CCCC
this is sample text6, FFFF
this is sample text6, FFFF
如果数量足够大,以至于您超出了并发打开文件的系统限制,并且您开始收到“打开的文件太多”错误,那么您需要使用 GNU awk,因为它会在内部处理该错误,或者将代码更改为一次只打开 1 个文件:X
NR==1 {
hdr = $0
next
}
$NF != prev {
close(out)
out = (((blockCnt++) % X) + 1) ".csv"
if ( blockCnt <= X ) {
print hdr > out
}
prev = $NF
}
{ print >> out }
或者实现您自己的方式来管理同时打开的文件数。
编辑:这是@PaulHodges在评论中提出的建议,将产生如下脚本:
NR == 1 {
for ( i=1; i <= X; i++ ) {
print > (i ".csv")
}
next
}
$NF != prev {
out = (((NR-1) % X) + 1) ".csv"
prev = $NF
}
{ print > out }
评论
blockCnt
BEGIN
NR==1 { next}
NR==1
这里
group.to_csv(file_+'.csv',index=False, header=False, mode='a')
您提供字符串作为第一个参数,但是to_csv
方法允许您提供类似文件的对象作为第一个参数,在这种情况下,您可以避免多次执行与文件打开相关的事情,请考虑以下简单比较
import os
import time
import pandas as pd
REPEAT = 1000
df = pd.DataFrame({'col1':range(100)})
t1 = time.time()
for _ in range(REPEAT):
df.to_csv('file.csv',index=False,header=False,mode='a')
t2 = time.time()
os.remove('file.csv')
t3 = time.time()
with open('file.csv','a') as f:
for _ in range(REPEAT):
df.to_csv(f,index=False,header=False)
t4 = time.time()
print('Using filename',t2-t1)
print('Using filehandle',t4-t3)
给出输出
Using filename 0.35850977897644043
Using filehandle 0.2669696807861328
请注意,第二种方式大约需要第 1 种方式的 75% 时间,因此虽然它更快,但它仍然是相同的数量级。
评论
ID