提问人:Rituraj Golawar 提问时间:10/30/2019 更新时间:10/31/2019 访问量:686
使用 SED 或 AWK 删除特定 CSV 列中的所有引号
Using SED or AWK to remove all quotes in a specific CSV column
问:
我有一个文件,里面有一堆 CSV 行,其中的值带引号和不带引号,如下所示:
"123","456",,17,"hello," how are you this, fine, highly caffienated morning,","2018-05-29T18:58:10-05:00","XYZ",
"345","737",,16,"Heading to a "meeting", unprepared while trying to be "awake","2018-05-29T18:58:10-05:00","ACD",
第五列是已转义或未转义双引号的文本列。我正在尝试删除本专栏中的所有引号,因此它看起来像这样
"123","456",,17,"hello, how are you this, fine, highly caffeinated morning,","2018-05-29T18:58:10-05:00","XYZ",
"345","737",,16,"Heading to a meeting, unprepared while trying to be awake","2018-05-29T18:58:10-05:00","ACD",
如何使用 SED 或 AWK 或任何其他 unix 工具实现这一点的任何想法?非常感谢!
答:
试试这个正则表达式:
,\d{2}\,(.*),\"\S{25}\",\"\w{3}"
它是根据你的例子制作的。目标只是占领 de 第五纵队。就像@Jerry耶利米建议的那样,重点是使用日期,即 wich 将永远是 25 个字符长。为了防止一些不匹配,我还考虑了第五个字母前的 2 位数字和日期之后的 3 个字母/数字。正则表达式101v1
我们还可以通过寻找确切的日期匹配来使用“更强”的正则表达式
,\d{2}\,(.*),\"\d{4}-\d{2}-\d{2}\w\d{2}:\d{2}:\d{2}-\d{2}:\d{2}\",\"\w{3}"
使用这些正则表达式,您将能够使用 group 提取第五列。要更深入地了解您的问题,您可以在 bash 中执行此操作:
regex='^(.*,[0-9]{2}\,")(.*)(",\"[0-9]{4}-[0-9]{2}-[0-9]{2}T[0-9]{2}:[0-9]{2}:[0-9]{2}-[0-9]{2}:[0-9]{2}\",\"[a-zA-Z]{3}".*$)'
while IFS= read -r line
do
if [[ $line =~ $regex ]]
then
before=${BASH_REMATCH[1]}
fifth=${BASH_REMATCH[2]}
after=${BASH_REMATCH[3]}
reworked_fifth="${fifth//\"}"
echo ${before}${reworked_fifth}${after}
else
echo "Line didnt match the regex"
fi
done < /my/file/path
我不得不更改正则表达式,因为我的 bash 没有接受 and .无需为此设置或惊慌失措。Bash 可以单独处理它。\d
\w
评论
No need to sed or awk anything with this. Bash can handle it alone.
echo
使用 awk,您可以做这样的事情来避免非常复杂的正则表达式。事实上,只有第五列被破坏了,前面的列不包含逗号,并且我们知道有固定数量的列,这使得它很容易修复:
按照 Ed Morton 的建议使用 gsub
进行编辑以实现可移植性
awk '
BEGIN{FS=OFS=","}
{
for(i=6; i<=NF-3;i++){
$5 = $5 FS $i
}
}
{
gsub(/"/, "", "g", $5)
}
{print $1,$2,$3,$4,"\""$5"\"",$(NF-2),$(NF-1),$NF}
' <file>
输出:
"123","456",,17,"hello, how are you this, fine, highly caffienated morning,","2018-05-29T18:58:10-05:00","XYZ",
"345","737",,16,"Heading to a meeting, unprepared while trying to be awake","2018-05-29T18:58:10-05:00","ACD",
如果你想转义引号,你可以使用这个:
awk '
BEGIN{FS=OFS=","}
{
for(i=6; i<=NF-3;i++){
$5 = $5 FS $i
}
}
{
gsub(/^"|"$/,"",$5);
gsub(/"/,"\\\"",$5);
$5="\""$5"\"";
}
{print $1,$2,$3,$4,$5,$(NF-2),$(NF-1),$NF}
' <file>
输出:
"123","456",,17,"hello,\" how are you this, fine, highly caffienated morning,","2018-05-29T18:58:10-05:00","XYZ",
"345","737",,16,"Heading to a \"meeting\", unprepared while trying to be \"awake","2018-05-29T18:58:10-05:00","ACD",
评论
regex
$5
$1
你的问题很难用笼统的方式回答。举个例子:
"a","b","c","d"
这是如何解释的(如果我们从感兴趣的字段中删除引号):
"a","b","c","d" (4 fields)
"a,b","c","d" (3 fields, $1 messed up)
"a","b,c","d" (3 fields, $2 messed up)
"a","b","c,d" (3 fields, $3 messed up)
"a,b,c","d" (2 fields, $1 messed up)
"a,b","c,d" (2 fields, $1 and $2 messed up)
"a","b,c,d" (2 fields, $2 messed up)
"a,b,c,d" (1 field , $1 messed up)
解决这个问题的唯一方法是掌握以下知识:
- 我的 CSV 有多少个字段
- 最多有一个字段搞砸了
- 我们知道哪个领域搞砸了
以下awk程序将帮助您修复它:
$ awk 'BEGIN{ere="[^,]*|\042[^\042]"}
{ head=tail=""; mid=$0 }
# extract the head which is correct
(n>1) {
ere_h="^"
for(i=1;i<n;++i) ere_h = ere_h (ere_h=="^" ? "",",") "(" ere ")"
match(mid,ere_h); head=substr(mid,RSTART,RLENGTH)
mid = substr(mid,RLENGTH+1)
}
# extract the tail which is correct
(nf>n) {
ere_t="$"
for(i=n+1;i<=nf;++i) ere_t = "(" ere ")" (ere_h=="$" ? "",",") ere_t
match(mid,ere_t); tail=substr(mid,RSTART,RLENGTH)
mid = substr(mid,1,RSTART-1)
}
# correct the mid part
{ gsub(/\042/,"",mid)
mid = (mid ~ /^,/) ? ( ",\042" substr(mid,2) ) : ( "\042" mid )
mid = (mid ~ /,$/) ? ( substr(mid,1,length(mid)-1) "\042," ) : (mid "\042" )
}
# print the stuff
{ print head mid tail }' n=5 nf=7 file
使用 GNU awk 将第 3 个参数用于 match(),并假设您知道每行中应该有多少个字段:
$ cat tst.awk
BEGIN {
numFlds = 8
badFldNr = 5
}
match($0,"^(([^,]*,){"badFldNr-1"})(.*)((,[^,]*){"numFlds-badFldNr"})",a) {
gsub(/"/,"",a[3])
print a[1] "\"" a[3] "\"" a[4]
}
$ awk -f tst.awk file
"123","456",,17,"hello, how are you this, fine, highly caffienated morning,","2018-05-29T18:58:10-05:00","XYZ",
"345","737",,16,"Heading to a meeting, unprepared while trying to be awake","2018-05-29T18:58:10-05:00","ACD",
对于其他 awks,您可以通过几次调用 match() 和变量而不是数组来做同样的事情。
评论
,,16
""