提问人:Anon 提问时间:7/22/2023 最后编辑:Anon 更新时间:7/27/2023 访问量:133
在多分辨率文件上迭代运行软件
Iteratively run a software on multi-resolution files
问:
.mcool 文件包含多个分辨率的矩阵。
一个 .mcool 文件的冷却器:
cooler ls ./../input/A001C007.hg38.nodups.pairs.mcool
./../input/A001C007.hg38.nodups.pairs.mcool::/resolutions/200
./../input/A001C007.hg38.nodups.pairs.mcool::/resolutions/500
./../input/A001C007.hg38.nodups.pairs.mcool::/resolutions/1000
./../input/A001C007.hg38.nodups.pairs.mcool::/resolutions/2000
./../input/A001C007.hg38.nodups.pairs.mcool::/resolutions/5000
./../input/A001C007.hg38.nodups.pairs.mcool::/resolutions/10000
./../input/A001C007.hg38.nodups.pairs.mcool::/resolutions/20000
./../input/A001C007.hg38.nodups.pairs.mcool::/resolutions/50000
./../input/A001C007.hg38.nodups.pairs.mcool::/resolutions/100000
./../input/A001C007.hg38.nodups.pairs.mcool::/resolutions/250000
./../input/A001C007.hg38.nodups.pairs.mcool::/resolutions/500000
./../input/A001C007.hg38.nodups.pairs.mcool::/resolutions/1000000
(EagleC)
对于 ./input/*.mcool 中的每个文件,如果 cooler ls “$mcool_file” 在最后一个 / 之后以 5000、10000 或 50000 结尾,我想从 EagleC 运行 predictSV。
如存储库所示,单个 .mcool 文件
predictSV --hic-5k SKNAS-MboI-allReps-filtered.mcool::/resolutions/5000 \
--hic-10k SKNAS-MboI-allReps-filtered.mcool::/resolutions/10000 \
--hic-50k SKNAS-MboI-allReps-filtered.mcool::/resolutions/50000 \
-O SK-N-AS -g hg38 --balance-type CNV --output-format full \
--prob-cutoff-5k 0.8 --prob-cutoff-10k 0.8 --prob-cutoff-50k 0.99999
但是,对于文件列表,我需要编写一个 for 循环来迭代运行。predictSV
我的尝试:
#!/usr/bin/env bash
for mcool_file in input/*.mcool; do
# iterate over ids emitted from cooler ls for this file
hic5k_num=; hic10k_num=; hic50k_num=
while IFS= read -r id; do
id_suffix=${id##*/}
case $id_suffix in
5000) hic5k_num=$id_suffix;;
10000) hic10k_num=$id_suffix;;
50000) hic50k_num=$id_suffix;;
esac
done < <(cooler ls "$mcool_file")
echo predictSV \
${hic5k_num:+ --hic-5k "$hic5k_num"} \
${hic10k_num:+ --hic-10k "$hic10k_num"} \
${hic50k_num:+ --hic-50k "$hic50k_num"} \
-g hg38 \
-O "${mcool_file%%.*}" \
--balance-type CNV \
--output-format full \
--prob-cutoff-5k 0.8 \
--prob-cutoff-10k 0.8 \
--prob-cutoff-50k 0.99999
done
但是,我的输出是:
predictSV --hic-5k 5000 --hic-10k 10000 --hic-50k 50000 -g hg38 -O input/A001C007 --balance-type CNV --output-format full --prob-cutoff-5k 0.8 --prob-cutoff-10k 0.8 --prob-cutoff-50k 0.99999
predictSV --hic-5k 5000 --hic-10k 10000 --hic-50k 50000 -g hg38 -O input/A001C008 --balance-type CNV --output-format full --prob-cutoff-5k 0.8 --prob-cutoff-10k 0.8 --prob-cutoff-50k 0.99999
预期输出:
predictSV --hic-5k ./../input/A001C007.hg38.nodups.pairs.mcool::/resolutions/500 --hic-10k ./../input/A001C007.hg38.nodups.pairs.mcool::/resolutions/10000 --hic-50k ./../input/A001C007.hg38.nodups.pairs.mcool::/resolutions/50000 -g hg38 -O input/A001C007 --balance-type CNV --output-format full --prob-cutoff-5k 0.8 --prob-cutoff-10k 0.8 --prob-cutoff-50k 0.99999
predictSV --hic-5k ./../input/A001C008.hg38.nodups.pairs.mcool::/resolutions/500 --hic-10k ./../input/A001C008.hg38.nodups.pairs.mcool::/resolutions/10000 --hic-50k ./../input/A001C008.hg38.nodups.pairs.mcool::/resolutions/50000 -g hg38 -O input/A001C008 --balance-type CNV --output-format full --prob-cutoff-5k 0.8 --prob-cutoff-10k 0.8 --prob-cutoff-50k 0.99999
编辑:
此外,我想将部分从以下部分更改为以下内容: + 文件中括号后面的子字符串。例如。-O
"${mcool_file%%.*}"
EagleC_output/
/
mcool_file
EagleC_output/A001C007
答:
你可以使用 GNU Parallel 来简化你的 bash 脚本。
使用以下命令修改 bash 脚本:parallel
编辑:修改了bash脚本,包括基于解析输入文件名的动态选项-O
grep
.mcool
#!/usr/bin/env bash
for mcool_file in input/*.mcool; do
OUTPUT=$(echo "$mcool_file" | grep -oE "[A-Z0-9]*\.hg38" | cut -d '.' -f 1)
cooler ls "$mcool_file" | grep -E '\/[1][0]{4}$|\/[5][0]{3,4}$' \
| tr '\n' '\t' \
| parallel --dry-run --colsep '\t' --link \
predictSV --hic-5k {1} --hic-10k {2} --hic-50k {3} \
-O 'EagleC_output/'$OUTPUT -g hg38 --balance-type CNV --output-format full \
--prob-cutoff-5k 0.8 --prob-cutoff-10k 0.8 --prob-cutoff-50k 0.99999 \
| tr -d "'"
done
它提供:
predictSV --hic-5k ./../input/A001C007.hg38.nodups.pairs.mcool::/resolutions/5000 --hic-10k ./../input/A001C007.hg38.nodups.pairs.mcool::/resolutions/10000 --hic-50k ./../input/A001C007.hg38.nodups.pairs.mcool::/resolutions/50000 -O EagleC_output/A001C007 -g hg38 --balance-type CNV --output-format full --prob-cutoff-5k 0.8 --prob-cutoff-10k 0.8 --prob-cutoff-50k 0.99999
predictSV --hic-5k ./../input/A001C008.hg38.nodups.pairs.mcool::/resolutions/5000 --hic-10k ./../input/A001C008.hg38.nodups.pairs.mcool::/resolutions/10000 --hic-50k ./../input/A001C008.hg38.nodups.pairs.mcool::/resolutions/50000 -O EagleC_output/A001C008 -g hg38 --balance-type CNV --output-format full --prob-cutoff-5k 0.8 --prob-cutoff-10k 0.8 --prob-cutoff-50k 0.99999
假设有两个文件命名为:
A001C007.hg38.nodups.pairs.mcool
A001C008.hg38.nodups.pairs.mcool
其中包含显示您在问题中提供的每行分辨率的输出。cooler ls
(注意:只需从同一个 bash 脚本中删除该选项,然后运行它,即可在第一次执行“试运行”后实际执行命令列表,以确保所有命令都将以正确的语法执行。还有一个非常有用的选项。--dry-run
--joblog
详:
为了帮助说明和分解上述 bash 脚本的工作原理:
我制作了一个名为“test.mcool”的测试文件,其中包含:
./../input/A001C007.hg38.nodups.pairs.mcool::/resolutions/200
./../input/A001C007.hg38.nodups.pairs.mcool::/resolutions/500
./../input/A001C007.hg38.nodups.pairs.mcool::/resolutions/1000
./../input/A001C007.hg38.nodups.pairs.mcool::/resolutions/2000
./../input/A001C007.hg38.nodups.pairs.mcool::/resolutions/5000
./../input/A001C007.hg38.nodups.pairs.mcool::/resolutions/10000
./../input/A001C007.hg38.nodups.pairs.mcool::/resolutions/20000
./../input/A001C007.hg38.nodups.pairs.mcool::/resolutions/50000
./../input/A001C007.hg38.nodups.pairs.mcool::/resolutions/100000
./../input/A001C007.hg38.nodups.pairs.mcool::/resolutions/250000
./../input/A001C007.hg38.nodups.pairs.mcool::/resolutions/500000
./../input/A001C007.hg38.nodups.pairs.mcool::/resolutions/1000000
以模拟 的输出。cooler ls
使用给出:bash grep
$ grep -E '\/[1][0]{4}$|\/[5][0]{3,4}$' test.mcool
./../input/A001C007.hg38.nodups.pairs.mcool::/resolutions/5000
./../input/A001C007.hg38.nodups.pairs.mcool::/resolutions/10000
./../input/A001C007.hg38.nodups.pairs.mcool::/resolutions/50000
然后可以通过管道传输给:parallel
grep -E '\/[1][0]{4}$|\/[5][0]{3,4}$' test.mcool | tr '\n' '\t' | parallel --dry-run --colsep '\t' --link predictSV --hic-5k {1} --hic-10k {2} --hic-50k {3} [...]
请参阅上面提供的 bash 脚本输出,其中还考虑了 的输出选项,并且该命令已设置为轻松使用您提供的 bash 脚本。predictSV
选项说明parallel
--dry-run
:仅打印出命令的执行方式(删除此选项将实际并行运行命令))--link
:告诉拆分,分隔(由选项指定;我曾经将换行符转换为制表符),并按照指示(使用大括号中包含的整数数字)将管道输入到要并行执行的命令模式模板中进行分区。parallel
--colsep
tr
(额外注意:进一步的改进是另外并行遍历所有 ./*.mcool
文件(即运行嵌套的并行
单个命令)。但是,只有当您关心通过更充分地实现 predictSV
命令的并行化来改善分析的整体计算时间时,这才有意义。
评论
predictSV --hic-5k ./../input/A001C007.hg38.nodups.pairs.mcool::/resolutions/500 --hic-10k ./../input/A001C007.hg38.nodups.pairs.mcool::/resolutions/10000 --hic-50k ./../input/A001C007.hg38.nodups.pairs.mcool::/resolutions/50000 -g hg38 -O input/A001C007 --balance-type CNV --output-format full --prob-cutoff-5k 0.8 --prob-cutoff-10k 0.8 --prob-cutoff-50k 0.99999
./../input/something/something
input/something/something
500
10000
cooler ls
echo "$mcool_file"
、 等与 、 等相对。看我刚刚分配了一个变量,但 echo $variable
显示了其他内容。在这种特殊情况下,如果设置为包含存在于我们正在处理的变量中的字符(like 或 ),它就会显现出来;还有与通球相关的副作用(可以显现或不显现,不仅取决于变量的内容,还取决于 [...]ls "$mcool_file"
echo $mcool_file
ls $mcool_file
IFS
:
/
鉴于您正在尝试执行的操作,变量名称中的后缀具有误导性。您不想存储数字;要存储 ID。因此,您应该存储 -- 或者,使用一个不那么容易误导的变量名称,而不是设置 :_num
hic5k_num=$id_suffix
hic5k_num=$id
hic5k=$id
#!/usr/bin/env bash
for mcool_file in input/*.mcool; do
# iterate over ids emitted from cooler ls for this file
hic5k_num=; hic10k_num=; hic50k_num=
while IFS= read -r id; do # id=./../input/A001C007.hg38.nodups.pairs.mcool::/resolutions/200
id_suffix=${id##*/} # id_suffix=200
id_part1=${id%%::*} # id_part1=./../input/A001C007.hg38.nodups.pairs.mcool
id_part2=${id##*::} # id_part2=/resolutions/200
# id_part1_basename=A001C007.hg38.nodups.pairs.mcool
id_part1_basename=${id_part1##*/}
# id_reassembled=A001C007.hg3.nodeps.pairs.mcool::/resolutions/200
id_reassembled=$id_part1_basename::$id_part2
case $id_suffix in
5000) hic5k=$id_reassembled;;
10000) hic10k=$id_reassembled;;
50000) hic50k=$id_reassembled;;
esac
done < <(cooler ls "$mcool_file")
mcool_basename=${mcool_file##*/}
printf '%q ' predictSV \
${hic5k:+ --hic-5k "$hic5k"} \
${hic10k:+ --hic-10k "$hic10k"} \
${hic50k:+ --hic-50k "$hic50k"} \
-g hg38 \
-O "EagleC_output/${mcool_basename%%.*}" \
--balance-type CNV \
--output-format full \
--prob-cutoff-5k 0.8 \
--prob-cutoff-10k 0.8 \
--prob-cutoff-50k 0.99999
printf '\n'
done
评论
-O
"${mcool_file%%.*}"
EagleC_output/
/
mcool_file
EagleC_output/A001C007
mcool_basename
| xargs -d $'\n' -P "$(nproc)" -n 1 -- bash -c
nprocs
predictSV
评论
hic5k_num=$id_suffix
应该这样你保存名称,而不仅仅是后缀。hic5k_num=$id
::/resolutions/5000
::/resolutions/10000
::/resolutions/50000
cooler ls
)hic5k_num
hic10k_num
hic15k_num
hic5k_num=$id_suffix
hic5k_num=$id