如何使用awk将文本文件中的多个模式捕获到多个文本块中,并将每个块打印到一个新文件中

How to use awk to capture multiple patterns in text file into several blocks of text and print each block to a new file

提问人:humbleStrength 提问时间:11/2/2023 最后编辑:humbleStrength 更新时间:11/3/2023 访问量:53

问:

我有这个绑定dns stats文本文件,:sample_data.txt

+++ Statistics Dump +++ (1698804161)
++ Incoming Requests ++
            34199522 QUERY
                   2 STATUS
                  12 UPDATE
++ Incoming Queries ++
                   2 RESERVED0
            19539834 A
              203203 NS
              239215 CNAME
               25636 SOA
              235650 PTR
                  96 HINFO
              922800 MX
              616897 TXT
                   5 RP
                  13 AFSDB
                   8 SIG
                   7 KEY
             9112095 AAAA
                  15 LOC
                  18 EID
              339894 SRV
                  75 NAPTR
                   7 KX
                  11 CERT
                 232 A6
                  55 DNAME
                   5 APL
                2172 DS
                  14 SSHFP
                   6 IPSECKEY
                  35 RRSIG
                 183 NSEC
              135429 DNSKEY
                   3 DHCID
                   8 NSEC3
                   6 NSEC3PARAM
                 196 TLSA
                  27 TYPE53
                  21 HIP
                  28 TYPE59
                  20 TYPE60
                  28 TYPE61
                   3 TYPE62
                  73 TYPE63
                 156 TYPE64
             2815625 TYPE65
                2297 SPF
                   7 TYPE108
                  11 TYPE109
                 752 AXFR
                1115 ANY
                   4 DLV
                5530 Others
++ Outgoing Queries ++
[View: default]
[View: _bind]
++ Name Server Statistics ++
            34199536 IPv4 requests received
            33035183 requests with EDNS(0) received
                1433 requests with TSIG received
               74232 TCP requests received
            20645922 auth queries rejected
                4604 recursive queries rejected
                 730 transfer requests rejected
                  12 update requests rejected
            34199536 responses sent
               71843 truncated responses sent
            33035183 responses with EDNS(0) sent
                1433 responses with TSIG sent
            24625387 queries resulted in successful answer
            33852582 queries resulted in authoritative answer
              135913 queries resulted in non authoritative answer
              135913 queries resulted in referral answer
             3911181 queries resulted in nxrrset
                   2 queries resulted in SERVFAIL
             5316014 queries resulted in NXDOMAIN
              210273 other query failures
++ Zone Maintenance Statistics ++
                 234 IPv4 notifies sent
++ Resolver Statistics ++
[Common]
[View: default]
[View: _bind]
++ Cache DB RRsets ++
[View: default]
[View: _bind (Cache: _bind)]
++ Socket I/O Statistics ++
                  27 UDP/IPv4 sockets opened
                   3 TCP/IPv4 sockets opened
                  25 UDP/IPv4 sockets closed
               74330 TCP/IPv4 sockets closed
               74338 TCP/IPv4 connections accepted
                  42 TCP/IPv4 recv errors
++ Per Zone Query Statistics ++
[sampledomain1.com]
             1898118 auth queries rejected
                  77 recursive queries rejected
                  16 transfer requests rejected
                  12 update requests rejected
             5125667 queries resulted in successful answer
            10890351 queries resulted in authoritative answer
               79163 queries resulted in non authoritative answer
               79163 queries resulted in referral answer
             2997088 queries resulted in nxrrset
             2767596 queries resulted in NXDOMAIN
[sampledomain2.com]
            18026742 auth queries rejected
                1945 recursive queries rejected
                  10 transfer requests rejected
            18773892 queries resulted in successful answer
            20863228 queries resulted in authoritative answer
               56644 queries resulted in non authoritative answer
               56644 queries resulted in referral answer
              778332 queries resulted in nxrrset
             1311004 queries resulted in NXDOMAIN
--- Statistics Dump --- (1698804161)

我尝试做的是使用 awk 捕获每个记录分隔符之间的文本块,不包括它,并将该块输出到新文件。因此,新文件 file1.txt 和 file2.txt 将包含:[anydomainname]

文件1.txt

             1898118 auth queries rejected
                  77 recursive queries rejected
                  16 transfer requests rejected
                  12 update requests rejected
             5125667 queries resulted in successful answer
            10890351 queries resulted in authoritative answer
               79163 queries resulted in non authoritative answer
               79163 queries resulted in referral answer
             2997088 queries resulted in nxrrset
             2767596 queries resulted in NXDOMAIN

文件2.txt

            18026742 auth queries rejected
                1945 recursive queries rejected
                  10 transfer requests rejected
            18773892 queries resulted in successful answer
            20863228 queries resulted in authoritative answer
               56644 queries resulted in non authoritative answer
               56644 queries resulted in referral answer
              778332 queries resulted in nxrrset
             1311004 queries resulted in NXDOMAIN

分别。

现在,这是我的工作:

 awk '/^\[[[:lower:]]/ {p=1; next};
     /^\[[[:lower:]]/ {p=0};
     {if (p==1) {print last} {last=$0}}' sample_data.txt | tail -n+2

这让我明白了这一点:

             1898118 auth queries rejected
                  77 recursive queries rejected
                  16 transfer requests rejected
                  12 update requests rejected
             5125667 queries resulted in successful answer
            10890351 queries resulted in authoritative answer
               79163 queries resulted in non authoritative answer
               79163 queries resulted in referral answer
             2997088 queries resulted in nxrrset
             2767596 queries resulted in NXDOMAIN
            18026742 auth queries rejected
                1945 recursive queries rejected
                  10 transfer requests rejected
            18773892 queries resulted in successful answer
            20863228 queries resulted in authoritative answer
               56644 queries resulted in non authoritative answer
               56644 queries resulted in referral answer
              778332 queries resulted in nxrrset
             1311004 queries resulted in NXDOMAIN

但正如你所看到的,我有两个问题。

  1. 我仍然需要将每个块拆分到其各自的域部分
  2. 然后,我需要将该文本块输出到新文件。

我可以通过扩展我当前的 awk 命令来做到这一点,使用 ,和条件,然后为每个块打印到文件?我现在才知道我是否可以在我思考时用 awk 做到这一点。TIA。BEGINfor

编辑:扩展我的问题,还包括如何输出包含行前文本块的第三个文件,因此这将在第一个模式匹配之前,现在将是第二个文件文本块的第二个入口点。++ Per Zone Query Statistics ++[anydomain]

文件3.txt

+++ Statistics Dump +++ (1698804161)
++ Incoming Requests ++
            34199522 QUERY
                   2 STATUS
                  12 UPDATE
++ Incoming Queries ++
                   2 RESERVED0
            19539834 A
              203203 NS
              239215 CNAME
               25636 SOA
              235650 PTR
                  96 HINFO
              922800 MX
              616897 TXT
                   5 RP
                  13 AFSDB
                   8 SIG
                   7 KEY
             9112095 AAAA
                  15 LOC
                  18 EID
              339894 SRV
                  75 NAPTR
                   7 KX
                  11 CERT
                 232 A6
                  55 DNAME
                   5 APL
                2172 DS
                  14 SSHFP
                   6 IPSECKEY
                  35 RRSIG
                 183 NSEC
              135429 DNSKEY
                   3 DHCID
                   8 NSEC3
                   6 NSEC3PARAM
                 196 TLSA
                  27 TYPE53
                  21 HIP
                  28 TYPE59
                  20 TYPE60
                  28 TYPE61
                   3 TYPE62
                  73 TYPE63
                 156 TYPE64
             2815625 TYPE65
                2297 SPF
                   7 TYPE108
                  11 TYPE109
                 752 AXFR
                1115 ANY
                   4 DLV
                5530 Others
++ Outgoing Queries ++
[View: default]
[View: _bind]
++ Name Server Statistics ++
            34199536 IPv4 requests received
            33035183 requests with EDNS(0) received
                1433 requests with TSIG received
               74232 TCP requests received
            20645922 auth queries rejected
                4604 recursive queries rejected
                 730 transfer requests rejected
                  12 update requests rejected
            34199536 responses sent
               71843 truncated responses sent
            33035183 responses with EDNS(0) sent
                1433 responses with TSIG sent
            24625387 queries resulted in successful answer
            33852582 queries resulted in authoritative answer
              135913 queries resulted in non authoritative answer
              135913 queries resulted in referral answer
             3911181 queries resulted in nxrrset
                   2 queries resulted in SERVFAIL
             5316014 queries resulted in NXDOMAIN
              210273 other query failures
++ Zone Maintenance Statistics ++
                 234 IPv4 notifies sent
++ Resolver Statistics ++
[Common]
[View: default]
[View: _bind]
++ Cache DB RRsets ++
[View: default]
[View: _bind (Cache: _bind)]
++ Socket I/O Statistics ++
                  27 UDP/IPv4 sockets opened
                   3 TCP/IPv4 sockets opened
                  25 UDP/IPv4 sockets closed
               74330 TCP/IPv4 sockets closed
               74338 TCP/IPv4 connections accepted
                  42 TCP/IPv4 recv errors
bash awk sed grep

评论

2赞 Barmar 11/2/2023
每次启动新的域部分时,请为输出文件名设置一个变量。然后在行中使用以打印到该文件。> variableprint

答:

1赞 anubhava 11/2/2023 #1

这应该适合您:awk

awk -v hdr="file3.txt" '
/^\+\+ Per Zone Query Statistics/ {
   hdr = ""
}
hdr {
   print > hdr
}
/^\[[[:lower:]]/ {         # indicates start domain [...]
   close(fn)
   fn = "file" ++f ".txt"  # construct output filename `fn`
   next
}
/^[^[:blank:]]/ {          # indicates end of block
   fn = ""
}
fn {
   print > fn              # prints each record to fn
}' file

评论

0赞 humbleStrength 11/2/2023
这行得通!您能否将其扩展为也输出另一个文件,该文件在第一个域匹配之前具有文本块?还是这样更复杂?那么所有高于记录分隔符的东西?++ Per Zone Query Statistics ++
0赞 anubhava 11/2/2023
可以做到,您能否编辑您的问题并显示此附加文件的预期内容?
0赞 anubhava 11/2/2023
查看我更新的答案
0赞 humbleStrength 11/2/2023
工程!将接受作为解决方案。您是否介意说出 awk 在匹配模式上方打印文本(如在 file3 输出中)与在匹配模式之后打印文本(如在 file1 和 file2 输出中)的区别是什么?我正在努力更好地理解。我不明白 awk 如何捕获上面的文本,而对于其他部分,它按照匹配模式捕获文本。/^\+\+ Per Zone Query Statistics//^\[[[:lower:]]/
1赞 anubhava 11/2/2023
因为我们在命令行中设置了一个变量并继续打印,直到我们命中行。对于其他情况,当我们找到一条带有图案的线时,我们开始打印,当我们找到一条非黑色的起始线时,我们停止打印。file3.txthdr++ Per Zone Query Statistics^\[[[:lower:]]
0赞 potong 11/3/2023 #2

这可能对你有用 (GNU csplit):

csplit -f file -b '%d.txt' --sup file '/^\[\w\+\.\w\+\]$/' '{*}'

在开始/结束且包含至少一个 .[].

命名从 0 开头的文件。filen.txtn

注意第一个文件 () 将包含第一个域之前的所有行。file0.txt