如何在 bash 中解析文件，awk...以更有效的方式链接程序和子程序-解网

问：

我需要为我们实际使用的程序和子程序列表生成一个树视图。

我们现在有一个工作脚本（KSH），但需要很长时间才能实现（3 到 4 小时之间），我很确定它可以更高效/快速地完成

实际工作脚本使用两个文件： 1：我们实际使用的程序列表（2926行） 2：与此程序关联的所有程序和子程序的列表（23922行）

为了更好地理解，这是文件 2 的表示：

A|B C D E
B|F G
C|J K L
E|
K|Z
L|M N
Z|Y X

程序 A 调用子程序 B、C、D 和 E 程序 B（由 A 调用）调用子程序 F 和 G 程序 C（由 A 调用）调用子程序 J、K 和 L 等。。。

从这两个文件中，我们需要编写一个 CSV 文件，它代表我们在第一个文件中使用的所有程序的树视图。

这是上一个表示的预期输出：

A
;B
;;F
;;G
;C
;;J
;;K
;;;Z
;;;;Y
;;;;X
;;L
;;;M
;;;N
;E

为了实现这一目标，我们实际上是这样做的：

主脚本的一部分调用csv_tree.sh：

#!/bin/ksh
cat program_list | awk -F "|" '{print($2)}' | sort | uniq >program_list.tmp
while read -r line; do
    begin=$(echo "$line" | awk -F "|" '{print $1}')
    sh csv_tree.sh "$begin" 2>/dev/null
done <$HOME/FISCALITE/OSCAR/FICHIER_IMPORT/FIC_PRG_TREE.tmp
#delete temporary file
rm program_list.tmp

csv_tree.sh ：

#!/bin/ksh
#set -x
entry_file="T_SS_PRG_N1"
output_csv_file="${1}_tree.csv"
: >"$output_csv_file"
echo $1 >>"$output_csv_file"
count=0
rank=0
chaine=$(grep -E "^$1" $entry_file | awk '{$1=""}1')
#function loop
loop() {
    rank=$((rank + 1))
    for prg in $2; do
        count=$((count + 1))
        if [[ -n $(echo ${prg} | grep ${chaine}) ]]; then
            rank=1
        fi
        #   calculate the number of ";" to print in the csv file
        separator=$(for ((i = 0; i < rank; ++i)); do echo -n ";"; done)
        echo "${separator}${prg}" >>"$output_csv_file"
        if [ $(echo ${prg} | egrep "^(X|Y|Z)") != "" ] && [ $(grep "^${prg}" ${entry_file} | awk -F\| '{print $2}' | tr -d ' ') != "" ]; then
            echo "${separator};$(grep "^${prg}" ${entry_file} | awk -F\| '{print $2}')" >>"$output_csv_file"
        else
            loop "${prg}_$count" "$(grep -E "^$prg" $entry_file | awk '{$1=""}1')"
            rank=$((rank - 1))
        fi
    done
}

#begin
loop "$1" "$chaine"

我很确定我们可以通过纯粹的 awk 来实现这一目标，但我并不是真的喜欢它：/

如果有人有一些技巧或其他方法可以做得更好，我会敞开心扉^^

bash 循环 awk 树 ksh

你能在输入中有一个循环，例如E调用A，即使输入是错误的吗？如果是这样，应该如何显示？A 和 B 可以同时呼叫 C 吗？如果是这样，应该如何显示？你说“我们需要写一个 csv 文件......”但是，您显示的预期输出不是 CSV（不是逗号分隔的，并且每行上的字段数量不同），因此请编辑您的问题以解决该矛盾。

1赞 Ed Morton 9/13/2023

通常，您应该在 awk 脚本中使用递归下降函数，请参阅 stackoverflow.com/a/32020697/1745001 和/或 stackoverflow.com/a/42736174/1745001 和/或 stackoverflow.com/a/47834902/1745001，至少了解如何执行您想要的操作。

0赞 Ed Morton 9/13/2023

还请澄清一下“我不是真的喜欢它”是什么意思，在您的陈述中“我很确定我们可以通过纯粹的 awk 实现这一目标，但我并不是真的喜欢它：/” - 这是否意味着您不知道如何使用 awk 但愿意学习，或者您不想要使用 awk 或其他东西的解决方案？如果这意味着您不希望使用解决方案，那么您应该从您的问题中删除它的标签，并说明您不想使用的其他工具。

1赞 user1934428 9/13/2023

只是好奇：递归程序是如何表示的？如？P:P

1赞 user1934428 9/13/2023

既然你关心性能，我就不写了这个。哪个程序调用哪个信息称为调用图，我会用一种语言编写它，这使得处理有向图变得方便，例如 Ruby 或 Java 或 C++。您的方法中的主要性能损失来自 IMO 来自每个输入的许多子进程 - 这太疯狂了，您可以很高兴它只运行数小时而不是数天。bash

答：

0赞 Kaz 9/13/2023 #1

在 TXR Lisp 中：

$ cat file1
A
B
C
E
L
$ cat file2
A|B C D E
B|F G
C|J K L
E|
K|Z
L|M N
Z|Y X
$ txr deps.tl file1 file2
A
;B
;;F
;;G
;C
;;J
;;K
;;;Z
;;;;Y
;;;;X
;;L
;;;M
;;;N
;D
;E
B
;F
;G
C
;J
;K
;;Z
;;;Y
;;;X
;L
;;M
;;N
E
L
;M
;N

代码如下所示：

(defstruct mod ()
  name
  children)

(defvarl mh (hash))

(defun resolve-names (m)
  (when (and m.children
             (stringp (first m.children)))
    (set m.children
         (append-each ((c m.children))
           (iflet ((cm [mh c]))
             (list cm))))))

(defun print-tree (m : path (prefix ""))
  (put-line `@prefix@{m.name}`)
  (unless (member m path)
    (let ((nprefix `;@prefix`)
          (npath (cons m path)))
      (each ((c m.children))
        (print-tree c npath nprefix)))))

(tree-bind (file1 file2) *args*
  (let ((used (file-get-lines "file1"))
        (deps (file-get-lines "file2")))
    ;; populate object hash with each definition
    (each ((d deps))
      (match `@left|@right` d
        ;; look up or create new node for left side
        (let ((m (or [mh left]
                     (set [mh left] (new mod name left))))
              (ch (tok #/\S+/ right)))
          ;; ensure node exists for each child
          (each ((c ch))
            (or [mh c] (set [mh c] (new mod name c))))
          ;; append children to left hand node
          (set m.children (append m.children ch)))))
    ;; convert child names to objects
    (dohash (n m mh)
      (resolve-names m))
    ;; print tree view for each used object
    (each ((u used))
      (iflet ((m [mh u]))
        (print-tree m)))))

对循环引用进行检查，但对重复项（菱形引用）不进行检查。print-tree

我们只为根程序生成树（根程序==其他程序未调用的程序）
OP 有多个根程序
OP 提到了两个文件 - 和 - 但在实际代码中使用了其他名称File1File2
我猜File2 == program_list
我们没有看到内容，所以我继续，好像不存在;OP 可能需要根据File1File1File1

我们将复制 OP 的示例数据以包含 2 个根程序（和）：AA1

$ cat program_list
A|B C D E
B|F G
C|J K L
E|
K|Z
L|M N
Z|Y X
A1|B1 C1 D1 E1
B1|F1 G1
C1|J1 K1 L1
E1|
K1|Z1
L1|M1 N1
Z1|Y1 X1

虽然其他语言将具有更好的性能，但这里有一种方法应该比 OP 的当前代码具有更好的性能（如果仅仅是因为它不调用任何子 shell）：bash

$ cat build_tree.bash
#!/bin/bash

unset   pc c p r                                 # just in case script is sourced
declare -A c p r                                 # associative arrays for p(arent), c(hild) and r(oot) nodes

#######
# parse data into p[], c[] and parent_child ( pc_<parent>[] ) arrays

while read -r line
do
    IFS='[ |]' read -ra a <<< "${line}"          # a[0]==parent, a[1-n]==children

    parent="${a[0]}"

    p[$parent]=1
    declare -n pc=pc_"${parent}"                 # nameref to create pc_<parent>=( children ) arrays, eg: pc_A=( B C D E )
    pc=()

    for ((i=1 ; i<${#a[@]} ; i++))               # loop through child nodes
    do
        child="${a[$i]}"
        c[$child]=1                              # add as index in c[] array; c[] used in next loop to determine which nodes are roots
        pc+=( "$child" )                         # add child to pc == pc_<parent>[] array
    done
done < program_list                              # process File2

#######
# determine root nodes

for node in "${!p[@]}"                           # loop through p[] array indices
do
    [[ -z "${c[$node]}" ]] && r[$node]=1         # if not an index in the c[] array then add as index in the r[] array
done

unset child                                      # no longer needed so go ahead and release memory

#######
# recursive function
#
# for a given parent:
# - loop through array of children
# - print a child and if the child has children then make recursive call

recurse() {
    local parent pfx pc child

    parent="$1"                                  # eg: A, B, D1
    pfx="$2"                                     # eg: ';', ';;', ';;;'

    declare -n pc=pc_"${parent}"                 # nameref for pc_<parent>[] array

    for child in "${pc[@]}"                      # loop through child nodes
    do
        echo "${pfx}${child}"                    # print child node

        # if child is also a parent then make recursive call

        [[ -n "${p[$child]}" ]] && recurse "${child}" "${pfx}${sep}"
    done
}

#######
# main step: loop through root nodes and make top-level recurse() calls

sep=';'

for root in "${!r[@]}"
do
    echo "########### new tree:"
    echo "${root}"
    recurse "${root}" "${sep}"
done

#######
# release memory ... just in case script is sourced

for parent in "${!p[@]}"
do
    declare -n pc=pc_"${parent}"                 # nameref for pc_<parent>[] array
    unset pc
done

unset c p r

笔记：

需要 NameRef （）支持bash 4.2+declare -n

试驾：

$ ./build_tree.bash
########### new tree:
A
;B
;;F
;;G
;C
;;J
;;K
;;;Z
;;;;Y
;;;;X
;;L
;;;M
;;;N
;D
;E
########### new tree:
A1
;B1
;;F1
;;G1
;C1
;;J1
;;K1
;;;Z1
;;;;Y1
;;;;X1
;;L1
;;;M1
;;;N1
;D1
;E1

awk 'function getpath(name){return (name in arr)?sprintf("%-10s%-10s",getpath(arr[name]),name):sprintf("%-10s",name)}function printasrecord(path){patsplit(path,parts,/.{10}/);depth=length(parts);$0="";$depth=parts[depth];print}BEGIN{FS="[| ]";PROCINFO["sorted_in"]="@ind_str_asc";OFS=";"}/[|]./{for(i=2;i<=NF;i+=1){arr[$i]=$1}}END{for(i in arr){paths[getpath(i)]};for(j in paths){printasrecord(j)}}' file.txt

给出输出

;B         
;;F         
;;G         
;C         
;;J         
;;K         
;;;Z         
;;;;X         
;;;;Y         
;;L         
;;;M         
;;;N         
;D         
;E

解释：我处理并创建数组，其中 key 是子数组，value 是它的父数组。然后我使用最近的递归函数将所有路径作为其他数组的键，这些数组以固定宽度格式呈现，我选择了值 10，但是如果您有比这更长的值，请相应地调整它。这允许使用字母顺序排序来获得所需的输出。我通过设置 to 来通知 GNU，需要这样的排序（有关可能的值，请参阅控制扫描）。我以这种方式遍历路径数组，对于每个键（路径），我使用 patsplit 将其解析为多个部分，然后我将当前行（）设置为空字符串并将 th 字段设置为路径第 th 部分的值，这是最后一部分。请注意，输出与所呈现的期望输出略有不同，因为没有（没有父级），并且相同深度的顺序由字母排序决定，因此在之前。file.txtarrgetpathAWKPROCINFO["sorted_in"]@ind_str_asc$0depthdepthAXY

免责声明：我的解决方案假设您正在使用有向、无环、连接的图，其中每个元素最多有 1 个直接祖先

（在 GNU Awk 5.1.0 中测试）

如何在 bash 中解析文件，awk...以更有效的方式链接程序和子程序

how to parse a file in bash, awk... for linking program and subprogram in a more efficient way

评论

评论

评论

评论