问：

我有一个网页，里面有一个非常简单的 HTML 表格。该网页的HTML代码如下。HTML 将始终保持不变，只有数据值会更改。

<!DOCTYPE html>
    <html lang="en">
      <head>
        <meta charset="utf-8">
        <meta name="viewport" content="width=device-width, initial-scale=1.0">
        <title>Demo Page Results</title>
        <link rel="stylesheet" href="/static/styles.css">
      </head>
      <body id="particles">
          <div class="container">
            <h1>Demo Page Results</h1>
            <table>
              <tr>
                <td>
                  Examiner:
                </td>
                <td class="marked">
                  Person
                </td>
              </tr>

              <tr>
                <td>
                  Qualification:
                </td>
                <td class="marked">
                  Teacher
                </td>
              </tr>
              <tr>
                <td>
                  Nation:
                </td>
                <td class="marked">
                  Any
                </td>
              </tr>

            </table>

          </div>

        <script src="/static/jrf.min.js"></script>
        <script src="/static/scripts.js"></script>
      </body>
    </html>

代码的输出如下

演示页面结果

审查员：人

任职资格：教师

国家：任何

我需要使用 bash 脚本来获取表的内容，如下所示

Examiner: Person

Qualification: Teacher

Nation: Any

我正在使用以下 bash 脚本

#!/bin/bash
# URL of the webpage
# containing the HTML
# table
URL="http://thewebpageurl"
# Use curl to fetch the
# HTML content of the
# webpage
HTML=$(curl -s "$URL")
#echo $HTML
# Use grep and sed to
# extract the table
# content
TABLE_HTML=$(echo "$HTML" | grep -o '\<table\[^\>\]\*\>.\*\</table\>' | sed 's/\<\\/table.\*//')
echo $TABLE_HTML
# Use awk to parse and
# print the table data
echo "$TABLE_HTML" | awk -F"\</tr\>" '{ for (i=1; i\<=NF;i++) { sub(/.\*\<td\[^\>\]\*\>/, "", $i); sub(/\<\\/td\>.\*/, "", $i); printf "%s ", $i; } printf "\\n"; }'

但遗憾的是，这并没有显示出任何结果。

请帮助并指导我修复 bash 脚本。谢谢尼哈

Linux Bash curl sed grep

awk逐行处理输入文件（或更具体地说，逐条记录处理，您可以在其中定义记录分隔符）。您不知道 HTML 文件的行结构。例如，同一页面可以写成一个包含所有 HTML 代码的长行。AWK 肯定不适合这项任务。我会在这里谷歌搜索 HTML 解析器。

0赞 Bodo 9/14/2023

您确定要处理的 HTML 代码始终以相同的方式格式化，即每行一个标签，值在与表相关的所有代码中作为单独的行？HTML实体呢，例如还是显示一个角色？请编辑您的问题以澄清这一点。或？&&&lynx --dump input.html | fgrep :

0赞 dawg 9/14/2023

义务的你。不能。解析。HTML格式。跟。A. 雷格克斯

0赞 Neha 9/14/2023

如何使用awk逐行处理？温馨导游

答：

1赞 Shawn 9/14/2023 #1

您可以在 HTML 模式下使用 XPath 表达式来提取表格单元格的内容，然后将其调整为所需的格式：xmllint

$ xmllint --html --xpath '//div[@class="container"]/table/tr/td/text()' input.html | perl -0777 -pe 's/:\s+/: /g; s/^\s+//mg'
Examiner: Person
Qualification: Teacher
Nation: Any

0赞 dawg 9/14/2023 #2

Ruby 有一个很棒的 XML/HTML 解析器，叫做 Nokogiri。

你可以做这样的事情：

ruby -r nokogiri -e 'Nokogiri::HTML($<.read).at("table").search("tr").
each{ |tr|
    cells = tr.search("th, td")
    puts cells.map{|cell| cell.text.strip }.join(" ")
}' file

指纹：

Examiner: Person
Qualification: Teacher
Nation: Any

通过一些打击和尝试......我得到了 bash curl -s myurl |awk '/<table>/，/<\/table>/' |sed -n '/<tr>/，/<\/tr>/p' |sed -e 's/<[^>]*>//g' |tr -d ' \n' 但输出是包含所有表数据值的单行，即 Examiner：PersonQualification：TeacherNation：Any 如何在每行后添加一个 #？例如：Examiner：Person#Qualification：Teacher#Nation：Any

0赞 Ed Morton 9/15/2023 #3

假设您的 HTML 格式永远不会改变，如您所说，然后使用任何 POSIX awk：

$ cat tst.sh
#!/usr/bin/env bash

awk '
    { gsub(/^[[:space:]]+|[[:space:]]+$/,"") }
    $0 == "</td>"                   { inTag=inVal=0 }
    inTag { tag = $0 }
    inVal { print tag, $0 }
    $0 == "<td>"                    { inTag=1 }
    $0 == "<td class=\"marked\">"   { inVal=1 }
' "${@:--}"

$ ./tst.sh file
Examiner: Person
Qualification: Teacher
Nation: Any

如果您真的希望在每个输出行后面有一个空行，如问题中的预期输出所示，则只需更改为 .print tag, $0print tag, $0 ORS

@Reino，谢谢，但这也不能重现 OP 所需的结果。但是，这确实等同于我的第一个解决方案。我的第二个是回应 OP 对 Ed Morton 的解决方案提出的第一条评论，只要稍加修改，就可以使用-e '//tr/normalize-space()'-e 'replace(join(//tr/normalize-space(),"#")," ","")'

0赞 Reino 9/17/2023

我用和得到了非常不同的结果，但它们可能不是最新的。但是，如果和没有得到相同的结果，则二进制文件不是最新的。此外，如果你想删除所有的空格，那么你可以完全放弃：pastesed//tr/normalize-space(td)//tr/normalize-space()xidelnormalize-space()-e 'join(//tr/replace(td,"\s+",""),"#")'

上一个：将带有特殊字符的字符串从文件复制到另一个文件的关键字 [closed] 之后

下一个：从文件中删除内容或使用新文件进行更新

用于获取 HTML 表格数据的 Bash 脚本

Bash Script To Get HTML Table Data

演示页面结果

评论

评论

评论

评论