提问人:Sumanta 提问时间:8/11/2023 最后编辑:SrinivasSumanta 更新时间:8/11/2023 访问量:38
读取模式不一致的.csv数据
Reading .csv data with inconsistent pattern
问:
我有一个非常大的CSV文件。我想通过 Pyspark 阅读它,但我无法正确阅读它。
示例 csv 为
"keyvalue","rto","state","maker_model","veh_type","veh_class"
"hnjsnjncjssssmj", "OD", "ODISHA", "BAJAJ AUTO", "Private Vehicle", "Car"
"hnjsnjncjssssjj", "OD", "ODISHA", "BAJAJ AUTO
", "Private Vehicle", "Car"
"hnjsnjncjssssmm", "GO", "GOA", "TATA MOTORS", "Private Vehicle", "Bus"
我想这样读
+---------------+-----+---------+--------------+------------------+---------+
| keyvalue| rto| state| maker_model| veh_type|veh_class|
+---------------+-----+---------+--------------+------------------+---------+
|hnjsnjncjssssmj| "OD"| "ODISHA"| "BAJAJ AUTO"| "Private Vehicle"| "Car"|
|hnjsnjncjssssjj| "OD"| "ODISHA"| "BAJAJ AUTO"| "Private Vehicle"| "Car"|
|hnjsnjncjssssmm| "GO"| "GOA"| "TATA MOTORS"| "Private Vehicle"| "Bus"|
但是我的 pyspark 无法正确识别第 2 行,它破坏了它
+--------------------+------+---------+--------------+------------------+---------+
| keyvalue| rto| state| maker_model| veh_type|veh_class|
+--------------------+------+---------+--------------+------------------+---------+
| hnjsnjncjssssmj| "OD"| "ODISHA"| "BAJAJ AUTO"| "Private Vehicle"| "Car"|
| hnjsnjncjssssjj| "OD"| "ODISHA"| "BAJAJ AUTO| null| null|
|", "Private Vehicle"| "Car"| null| null| null| null|
| hnjsnjncjssssmm| "GO"| "GOA"| "TATA MOTORS"| "Private Vehicle"| "Bus"|
+--------------------+------+---------+--------------+------------------+---------+
我已经在 spark 的读取 csv 函数中尝试了各种配置,但到目前为止没有任何效果。请指导我?
答:
0赞
I. Rawlinson
8/11/2023
#1
如果文件不是太大,则可以使用正则表达式来修复虚行,然后使用 spark 读取固定文件。
import re
# matches lines where the last character is not a "
pattern = r'(?<=[^\"])\n'
with open('data.csv', 'r') as data_file:
content = data_file.read()
fixed = re.sub(pattern, '', content)
with open('fixed.csv', 'w') as out_file:
out_file.write(fixed)
1赞
Shubham Sharma
8/11/2023
#2
Spark 提供了一些用于读取 csv 文件的有用选项。在您的情况下,我们可以使用
df = (
spark.read
.option('header', True)
.option('multiline', True)
.option("ignoreLeadingWhiteSpace", True)
.csv('data.csv')
)
df.show()
+---------------+---+------+------------+---------------+---------+
| keyvalue|rto| state| maker_model| veh_type|veh_class|
+---------------+---+------+------------+---------------+---------+
|hnjsnjncjssssmj| OD|ODISHA| BAJAJ AUTO|Private Vehicle| Car|
|hnjsnjncjssssjj| OD|ODISHA|BAJAJ AUTO\n|Private Vehicle| Car|
|hnjsnjncjssssmm| GO| GOA| TATA MOTORS|Private Vehicle| Bus|
+---------------+---+------+------------+---------------+---------+
评论
0赞
Sumanta
8/12/2023
这有效,谢谢,如果我想从我的 df 中删除 \n,该怎么办?
评论