后续问题:如果字符满足位置关系,则查找存在于两条不同行中的字符

Follow up question to: find characters present in two different lines if they satisfy a positional relationship

提问人:Jalan 提问时间:10/13/2022 更新时间:10/13/2022 访问量:35

问:

这是对下面概述的这个问题的后续。 我有以下三个字符串(忽略以 > 开头的行)


>chain A
---------MGPRLSVWLLLLPAALLLHEEHSRAAA--KGGCAGSGC-GKCDCHGVKGQKGERGLPGLQGVIGFPGMQGPEGPQGPPGQKGDTGEPGLPGTKGTRGPPGASGYPGNPGLPGIPGQDGPPGPPGIPGCNGTKGERGPLGPPGLPGFAGNPGPPGLPGMKGDPGEILGHVPGMLLKGERGFPGIPGTPGPPGLPGLQGPVGPPGFTGPPGPPGPPGPPGEKGQMGLSFQGPKGDKGDQGVSGPPGVPGQA-------QVQEKG
>chain B
---------MGPRLSVWLLLLPAALLLHEEHSRAAA--KGGCAGSGC-GKCDCHGVKGQKGERGLPGLQGVIGFPGMQGPEGPQGPPGQKGDTGEPGLPGTKGTRGPPGASGYPGNPGLPGIPGQDGPPGPPGIPGCNGTKGERGPLGPPGLPGFAGNPGPPGLPGMKGDPGEILGHVPGMLLKGERGFPGIPGTPGPPGLPGLQGPVGPPGFTGPPGPPGPPGPPGEKGQMGLSFQGPKGDKGDQGVSGPPGVPGQA-------QVQEKG
>chain C
MGRDQRAVAGPALRRWLLLGTVTVGFLAQSVLAGVKKFDVPCGGRDCSGGCQCYPEKGGRGQPGPVGPQGYNGPPGLQGFPGLQGRKGDKGERGAPGVTGPKGDVGARGVSGFPGADGIPGHPGQGGPRGRPGYDGCNGTQGDSGPQGPPGSEGFTGPPGPQGPKGQKGEP-YALPKEERDRYRGEPGEPGLVGFQGPPGRPGHVGQMGPVGAPGRPGPPGPPGPKGQQGNRGLGFYGVKGEKGDVGQPGPNGIPSDTLHPIIAPTGVTFH

我想找出满足以下关系的三个链中所有 R 和 D/E 的字符位置

Ri (chain A) - Di+2 (chain B)
Ri (chain B) - Di+2 (chain C)
Ri (chain C) - Di+5 (chain A)

解释:遍历链 A 中的每个第 i 个 R,并检查链 B 的 i+2 位置是否包含 D 或 E。如果是,则输出每个此类 R 和 D/E 对的字符位置。对链 B+C 和链 C+A 执行相同的操作。

抓住:在确定关系时,它应该计算破折号。但是在打印位置时,它应该忽略破折号。

使用原始问题中发布的脚本,我得到以下输出

B-C 187 R E

输出应该是什么

B-C 175-188 R E

我修改了原始问题中发布的代码以包含更正

awk '
    { chain_id[++c]=$2                                     # save chain id, eg, "A", "B", "C"
      getline                                              # read next line from input file
      chains[c]=$0                                         # save associated chain
    }

END { i_char="R"                                           # character to search for in 1st chain

      for (i=1;i<=c;i++) {                                 # loop through list of chains
          j= (i==c ? 1 : i+1)                              # determine index of 2nd chain
          offset= (i==c ? 5 : 2)                           # +2 for A-B, B-C; +5 for C-A

          chain_i=chains[i]                                # copy chains as we are going to cut them up as we process them
          chain_j=chains[j]
         
          
          
          chain_pair= chain_id[i] "-" chain_id[j]          # build output label, eg, "A-B"
          pos=0                                            # reset position

          while (length(chain_i)>0) {

                n=index(chain_i,i_char)                    # look for "K"
                
                if (n==0) break                            # if not found we are done with this chain pair so break out of loop else ...
                pos=pos+n                                  # update our position in the chain and ...pos is the field position
                j_char=substr(chain_j,n+offset,1)          # find character from 2nd chain at location n+2
                
                
                if (j_char ~ /D|E/) {
                corr_i=substr(chain_i,1,n)
                corr=gsub (/-/,"",corr_i)                  # if 2nd chain character is one of "D" or "E" then ..
                corr_pos=pos-corr
                 print chain_pair,corr_pos,i_char,j_char   # print our finding
                }

                chain_i=substr(chain_i,n+1)                # strip off 1st n characters
                chain_j=substr(chain_j,n+1)
          }
      }
    }
' file

但这无济于事,输出不正确。

B-C 187 R E
字符串 bash awk 序列

评论

0赞 RavinderSingh13 10/13/2022
您能否在您的问题中也发布预期的示例输出以使其更清楚,谢谢。
0赞 Nic3500 10/13/2022
输出应该是什么 B-C 175-188 R E*。如何?如果我看一下字符串 B 的位置 175,它是 ,而不是 。LR
0赞 Jalan 10/13/2022
@Nic3500:如果忽略输出想要的破折号“-”,则为 175-188。在确定关系时,它应该计算破折号。但是在打印 R 和 E 的位置时,它应该忽略破折号。
0赞 Jalan 10/13/2022
@RavinderSingh13:预期输出为 ''' B-C 175-188 R E '''

答:

1赞 markp-fuso 10/13/2022 #1

添加一些逻辑来保持破折号计数:

awk '
    { chain_id[++c]=$2; getline; chains[c]=$0 }
END { i_char="R"
      for (i=1;i<=c;i++) {

          j= (i==c ? 1 : i+1)
          offset= (i==c ? 5 : 2)

          chain_i=chains[i]
          chain_j=chains[j]

          chain_pair= chain_id[i] "-" chain_id[j]
          pos=dash_cnt_i=dash_cnt_j=0

          while (length(chain_i)>0) {

                n=index(chain_i,i_char)
                if (n==0) break

                pos=pos+n

                head_i = substr(chain_i,1,n)                    # copy everything up to matching character
                head_j = substr(chain_j,1,n)                    # copy everything up to matching character

                dash_cnt_i += gsub(/-/,"",head_i)               # add count of dashes in head_i; gsub() returns number of substitutions which in this case is also the number of dashes in head_i
                dash_cnt_j += gsub(/-/,"",head_j)               # add count of dashes in head_j

                j_char=substr(chain_j,n+offset,1)

                if (j_char ~ /E|D/)
                   print chain_pair,(pos-dash_cnt_i) "-" (pos+offset-dash_cnt_j) ,i_char,j_char

                chain_i=substr(chain_i,n+1)
                chain_j=substr(chain_j,n+1)
          }
      }
    }
' file.txt

这将产生:

A-B 355-357 R E
A-B 390-392 R E
A-B 597-599 R D
A-B 781-783 R E
A-B 917-919 R D
A-B 968-970 R D
A-B 1063-1065 R E
A-B 1516-1518 R D
A-B 1638-1640 R E
B-C 175-188 R E                 # OP's expected result
B-C 346-364 R D
B-C 355-373 R E
B-C 396-414 R D
B-C 500-519 R D
B-C 585-602 R D
B-C 917-963 R E
B-C 1063-1108 R E
B-C 1173-1218 R D
B-C 1516-1562 R D
C-A 334-321 R E
C-A 400-389 R E
C-A 471-459 R E
C-A 740-706 R D
C-A 1228-1190 R E
C-A 1589-1552 R E