提问人:greg 提问时间:11/17/2023 更新时间:11/18/2023 访问量:53
Python pandas 使用块读取大型 csv
Python pandas read large csv with chunk
问:
我正在尝试在读取大型 CSV 文件时优化我的代码。
我在几个网站上看到我们可以将“chunksize”与熊猫一起使用。
我使用代码读取csv文件:
data = pd.read_csv(zf.open(f) , skiprows=[0,1,2,3,4,5], header=None, low_memory=False)
for _, df in data.groupby(data[0].eq("No.").cumsum()):
dfs = []
df = pd.DataFrame(df.iloc[1:].values, columns=df.iloc[0].fillna(99))
dfs.append(df.rename_axis(columns=None))
date_pattern='%Y/%m/%d %H:%M'
df['epoch'] = df.apply(lambda row: int(time.mktime(time.strptime(row.time,date_pattern))), axis=1) # create epoch as a column
for each_column in list(df.columns)[2:-1]:
others line with "each_column" ...
...
我尝试使用带有chunksize的代码,但是出现错误。
data = pd.read_csv(zf.open(f) , skiprows=[0,1,2,3,4,5], header=None, low_memory=False, chunksize=1000)
for _, df in data.groupby(data[0].eq("No.").cumsum()):
dfs = []
df = pd.DataFrame(df.iloc[1:].values, columns=df.iloc[0].fillna(99))
dfs.append(df.rename_axis(columns=None))
date_pattern='%Y/%m/%d %H:%M'
df['epoch'] = df.apply(lambda row: int(time.mktime(time.strptime(row.time,date_pattern))), axis=1) # create epoch as a column
for each_column in list(df.columns)[2:-1]:
others line with "each_column"
错误:
on 0: Process Process-22:9:
Traceback (most recent call last):
File "/usr/lib64/python3.6/multiprocessing/process.py", line 258, in _bootstrap
self.run()
File "/usr/lib64/python3.6/multiprocessing/process.py", line 93, in run
self._target(*self._args, **self._kwargs)
File "/opt/import2grafana/libs/parsecsv.py", line 236, in readcsv_multi_thread
for _, df in data.groupby(data[0].eq("No.").cumsum()):
AttributeError: 'TextFileReader' object has no attribute 'groupby'
是否可以将 chuncksize 与 groupby 一起使用?
非常感谢您的帮助。
答:
1赞
Zihao
11/17/2023
#1
Chunksize 将 TextFileReader,
for _, df in data:
例如,DataFrame 的方法应该应用于 df。
data = pd.read_csv(zf.open(f) , skiprows=[0,1,2,3,4,5], header=None, low_memory=False)
for _, df in data:
df = df.groupby(data[0].eq("No.").cumsum())
0赞
Corralien
11/17/2023
#2
您不能在 上使用 。尝试如下操作:groupby
data
dfs = []
for df in data:
for _, subdf in chunk.groupby(data[0].eq("No.").cumsum()):
# do stuff here
dfs.append(subdf)
0赞
greg
11/18/2023
#3
我找到了正确的方法:
chunk_size = 100000000000 # Set your desired chunk size here
data_reader = pd.read_csv(zf.open(f), skiprows=[0, 1, 2, 3, 4, 5], header=None, low_memory=False, chunksize=chunk_size)
for data_chunk in data_reader:
for _, df in data_chunk.groupby(data_chunk[0].eq("No.").cumsum()):
dfs = []
df = pd.DataFrame(df.iloc[1:].values, columns=df.iloc[0].fillna(99))
dfs.append(df.rename_axis(columns=None))
date_pattern = '%Y/%m/%d %H:%M'
df['epoch'] = df.apply(lambda row: int(time.mktime(time.strptime(row.time, date_pattern))), axis=1)
for each_column in list(df.columns)[2:-1]:
some stuff lines
谢谢你的帮助。
评论
0赞
greg
11/19/2023
时间没有改变,这种方式没有任何好处。我不明白我可以优化谁
评论
chunksize
pandas.read_csv
TextFileReader