如何使用 bash 脚本提取文本文件中多次出现的 html TH 标签的值？-解网

问：

我有一个包含html标记的文本文件。我想提取本节中的值：

<th scope="col" class="text-center">158</th>
<th scope="col" class="text-center">139 (87.97%)</th>
<th scope="col" class="text-center">18 (11.39%)</th>
<th scope="col" class="text-center">0 (0.00%)</th>
<th scope="col" class="text-center">1 (0.63%)</th>
<th scope="col" class="text-center">0 (0.00%)</th>

这些值会不时更改，但始终只有 6 个 thesr 标签。我试过做这样的事情：

text="$(cat email_resp.txt | grep -n '<th scope="col" class="text-center">' | sort)"

我也试过这个：

text2="$(sed -n '/<th scope="col" class="text-center">/,/<\/th>/p' email_resp.txt)"

但是我得到的就像一个文本的“斑点”，我无法迭代它。

689:                        <th scope="col" class="text-center">158</th>
690:                        <th scope="col" class="text-center">139 (87.97%)</th>
691:                        <th scope="col" class="text-center">18 (11.39%)</th>
692:                        <th scope="col" class="text-center">0 (0.00%)</th>
693:                        <th scope="col" class="text-center">1 (0.63%)</th>
694:                        <th scope="col" class="text-center">0 (0.00%)</th>

这是我使用 sed 命令时的输出：

<th scope="col" class="text-center">158</th>
<th scope="col" class="text-center">139 (87.97%)</th>
<th scope="col" class="text-center">18 (11.39%)</th>
<th scope="col" class="text-center">0 (0.00%)</th>
<th scope="col" class="text-center">1 (0.63%)</th>
<th scope="col" class="text-center">0 (0.00%)</th>

理想情况下，我想做的是将标签之间的这些值提取到数组或变量中，以便我可以在其他地方使用它们。<th>

正则表达式 bash sed grep

-P用于 PCRE，仅在 GNU grep 中可用。所使用的正则表达式没有任何特定于 PCRE 的内容，它只是一个普通的旧 BRE，因为 grep 默认使用，因此您可以删除以提高性能和可移植性。在 shell 中引用的规则是使用单引号，直到/当您需要双引号时 - 如果您遵循该规则，则不需要转义正则表达式中的双引号 - 但是您在使用时也根本不需要，并且在使用任何一个时都不需要。-Pgrep '<th scope="col" class="text-center">'grepawkcat

1赞 Ed Morton 9/19/2023

为了演示最后一点 - 可以将多个命令的管道编写为单个命令cat email_resp.txt | grep -P "<th scope=\"col\" class=\"text-center\">" | awk 'BEGIN { FS = "<|>"; } { print $3; }'awk 'BEGIN { FS = "<|>"; } /<th scope="col" class="text-center">/{ print $3; }' email_resp.txt

0赞 Reilas 9/15/2023 #2

"...理想情况下，我想做的是将第 th 个标签之间的这些值提取到数组或变量中，以便我可以在其他地方使用它们。..."

您可以将 “--perl-regexp” 和 “--only-matching” 开关与 grep 一起使用。

grep -Po '(?<=<th scope="col" class="text-center">).+(?=</th>)' data.txt

158
139 (87.97%)
18 (11.39%)
0 (0.00%)
1 (0.63%)
0 (0.00%)

0赞 ashish_k 9/15/2023 #3

用：GNU sed

您可以在变量中捕获输出，如下所示：

var=$(sed -rn 's#.*class="text-center">(.*)</th>#\1#p' file_name)

解释：

-r        use extended regular expressions in the script.
-n        suppress automatic printing of pattern space
using '#' as separator and trying to capture only the required field inside '()' and printing the first captured group using \1

输出：

echo "$var"
158
139 (87.97%)
18 (11.39%)
0 (0.00%)
1 (0.63%)
0 (0.00%)

1赞 Chaitanya 9/15/2023 #4

如果你有 GNU Awk 4 及以上版本，你可以这样做：

$ sed 's/.*>\(.*\)<.*/\1/' markup.txt | 
    awk '
         BEGIN{
             PROCINFO["sorted_in"] = "@ind_num_asc"
         }
         { arr[NR]=$0 }
         END{
             for (i in arr) print i, arr[i]
         }
    '

使用 awk 时不需要 sed。 = 使用 GNU awk。不过，你对 awk 脚本的意图并不明显 - 它会打印它从中获得的输入，前面有行号，但或类似的东西可以做到这一点，即使在 awk 中它也只是sed 's/.*>$.*$<.*/\1/' file | awk '{arr[NR]=$0}'awk '{arr[NR]=gensub(/.*>(.*)<.*/,"\\1",1)}' filesedcat -nawk '{print NR, $0}'

1赞 ufopilot 9/15/2023 #5

#!/bin/bash


source <(
        awk -F'<th scope="col" class="text-center">|</th>' '
                BEGIN{print "declare -a myArr1=(" }
                NF==3{print "\047"$2"\047"}
                END{print ")"}
        ' file
)

declare -a myArr2="(
        $(
                awk -F'<th scope="col" class="text-center">|</th>' '
                     NF==3{print "\047"$2"\047"}
                ' file
        )
)"

declare -p myArr1
declare -p myArr2

declare -a myArr1=([0]="158" [1]="139 (87.97%)" [2]="18 (11.39%)" [3]="0 (0.00%)" [4]="1 (0.63%)" [5]="0 (0.00%)")
declare -a myArr2=([0]="158" [1]="139 (87.97%)" [2]="18 (11.39%)" [3]="0 (0.00%)" [4]="1 (0.63%)" [5]="0 (0.00%)")

如何使用 bash 脚本提取文本文件中多次出现的 html TH 标签的值？

How can I extract the value of html TH tags that occur multiple times in a text file using a bash script?

评论

评论

评论

评论