提问人:Ivan 提问时间:6/2/2023 最后编辑:Mark RotteveelIvan 更新时间:6/11/2023 访问量:69
Bash 脚本用于计算报纸文本文件中的字符数并创建 CSV 输出
Bash script to count characters in a newspaper text file and create CSV output
问:
我有一个大约 80K 行和 5.6 MB 的文本文件,由几篇报纸文章组成。大约有900件不同尺寸的物品。该文件以 UTF-8 编码,语言为法语(口音)。
我想计算每篇文章的字符数,并将结果存储在格式如下的CSV文件中:
article1;numchar1
article2;numchar2
...
我所说的字符是指字母、数字、标点符号和空格。我不想计算(文件已在 Linux 机器上生成)。\n
一篇新文章以一行报纸名称开头(比如 news01、news02 等)。我可以列出它们(大约 15 个差异)。我希望忽略后面的行 + 1 或 2 行(每份报纸的固定编号)。
还有一些免责声明行,我不想被包括在计数中。免责声明长度为 2 或 3 行,同一份报纸始终相同)。
我正在考虑构建一个包含报纸名称、标题行数和免责声明行数的数组。喜欢这个:
declare -A INFOS
INFOS[news01N]="Newspaper 1 name"
INFOS[news01H]=1
INFOS[news01D]=2
INFOS[news02N]="Newspaper 2 name"
INFOS[news02H]=2
INFOS[news02D]=1
其中 newsXX 是报纸的 ID,newsXXN 是名称,H 是标题,D 是免责声明。这些长度对于每份报纸都是固定的,所以应该没问题。
剩余的文本(即报纸名称和下一个免责声明之间)被视为文章。对于我的研究来说,这应该足够准确。
下面是一个示例文件:
news01
this line to be ignored too
this is the
first article
of the file
disclaimer to
be ignored
news02
these lines to be
ignored too
this is the
second article
of the file
disclaimer to be ignored
这应该输出:
news01;37
news02;38
有些行可能以随机数量的空格或表格开头或结尾。我希望将它们减少到最多一个空格字符(见上文)。
我可以请您帮助实现 bash 脚本来进行计数吗?
我真的是 bash 脚本的新手,这是我到目前为止所拥有的:
#! /bin/bash
while IFS= read -r line; do
if [[ "$line" != *"news01"* ]]; then
echo ${#line};
fi
done < $1
这是受我在这里看到的帖子的启发。这离我的目标还很远,但这就是我作为 MWE 所拥有的一切。
答:
每当您需要进行复杂的(阅读:任何琐碎的)文本处理时,您都是最好的朋友。sed
awk
下面是一个适合您示例的脚本。它忽略所有空行,因此它假定,当您指定要跳过的行数时,标头以非空行开头,免责声明以非空行结尾。awk
#!/usr/bin/awk -f
# Run with:
# ./newscount.awk DATA_FILE
# Debug output:
# awk -v debug=1 -f newscount.awk DATA_FILE
BEGIN {
# Initialization.
name[1] = "news01"
hdrs[name[1]] = 1 # Header lines to skip.
disc[name[1]] = 2 # Disclaimer lines to skip.
chrs[name[1]] = 0 # Chars in articles from this newspaper.
name[2] = "news02"
hdrs[name[2]] = 2
disc[name[2]] = 1
chrs[name[2]] = 0
paper = "" # Current newspaper name.
}
# Ignore lines that are blank or contain only spaces.
# I.e. zero or more (`*`) spaces from the start (`^`) to the end (`$`).
/^[[:space:]]*$/ { next }
# Start of a new article: the entire line (`$0`) is one of
# the keys in the `hdrs[]` array. (We could have used any of
# the arrays; the choice is arbitrary.)
$0 in hdrs {
update_chars(paper)
paper = $0 # Current newspaper name.
line = 0 # Line number in current article.
delete counts # Array: `counts[line]` contains # of chars in line `line`.
if (debug) {
printf "Article for %s\n", paper
}
next
}
# Save the character count for this line.
paper != "" {
# Number of chars in this line, with whitespace compressed.
# Replace one or more (`+`) spaces with a single space (`" "`)
# everywhere they occur (`"g"`) in the line.
text = gensub(/[[:space:]]+/, " ", "g")
# Remove leading and trailing whitespace. Replace zero or more
# spaces, followed by any chars, followed by zero or more spaces
# with the "any chars" (`"\\1"`) that we found.
text = gensub(/^ *(.+) */, "\\1", "g")
counts[line] = length(text)
if (debug) {
printf "%s %d: %d <%s>\n", paper, line, counts[line], text
}
line++
}
END {
# When we reach the end of the file, add the chars from the last article.
update_chars(paper)
for (paper in chrs) {
printf "%s;%d\n", paper, chrs[paper]
}
}
# Add the number of chars in this article, ignoring headers and disclaimers.
function update_chars(paper) {
if (paper == "") {
return
}
h = hdrs[paper]
d = disc[paper]
if (line < h + d) {
printf "Error for %s: cannot skip %d headers and %d disclaimers "\
"in article with %d lines.\n", paper, h, d, line
return
}
# Sum the chars from the first line after the headers
# through the last line before the disclaimers.
for (i = h; i < line - d; i++) {
chrs[paper] += counts[i]
}
}
示例文件的输出:
$ awk -v debug=1 -f ./newscount.awk news.txt
Article for news01
news01 0: 27 <this line to be ignored too>
news01 1: 12 <this is the >
news01 2: 14 <first article >
news01 3: 11 <of the file>
news01 4: 14 <disclaimer to >
news01 5: 10 <be ignored>
Article for news02
news02 0: 17 <these lines to be>
news02 1: 11 <ignored too>
news02 2: 12 <this is the >
news02 3: 15 <second article >
news02 4: 11 <of the file>
news02 5: 24 <disclaimer to be ignored>
news01;37
news02;38
$ ./newscount.awk news.txt
news01;37
news02;38
评论
awk '{print $1 ":" length($0)}' file
newsXX
disclaimer
newsXX
news01
news02
\n
\r\n
grep -iv disclaimer
awk