提问人:harpers29 提问时间:11/15/2023 最后编辑:Timur Shtatlandharpers29 更新时间:11/16/2023 访问量:85
我应该如何在 fasta 文件中实现条件字符串替换?
How should I go about implementing conditional string replacements in a fasta file?
问:
我有一个大的fasta文件,每个序列标题中都有各种细菌物种名称,如下所示:
文件.fasta
>Bacteria;Actinobacteria;Actinobacteria;Streptomyces;Streptomycetaceae;Streptomyces;Streptomyces_sp._AA4;
TTGGCAGTCTCTCCCGCGAACCAGGCCACTGCTGCGACCACCTCGGCTGAATCCCGCGCGCAGGCCACGGGAATCCCCGG
>Bacteria;Actinobacteria;Actinobacteria;Pseudonocardiales;Pseudonocardiaceae;Amycolatopsis;Amycolatopsis_niigatensis;
TTGGCAGTCTCTCCCGCGAACCAGGCCACTGCTGCGACCACCTCGGCTGAATCCCGCGCGCAGGCCACGGGAATCCCCGG
我想做的是搜索单个物种链霉菌的每个标题,如果列出,则仅将整个标题替换为“链霉菌”,否则替换整个标题“非链霉菌”:
new_file.fasta
>Streptomyces
TTGGCAGTCTCTCCCGCGAACCAGGCCACTGCTGCGACCACCTCGGCTGAATCCCGCGCGCAGGCCACGGGAATCCCCGG
>Not Streptomyces
TTGGCAGTCTCTCCCGCGAACCAGGCCACTGCTGCGACCACCTCGGCTGAATCCCGCGCGCAGGCCACGGGAATCCCCGG
我的第一反应是使用 awk 或 sed 之类的东西来做这个替换,但我在弄清楚如何替换整个字符串时遇到了麻烦。
我应该怎么做?
答:
0赞
dawg
11/15/2023
#1
在任何 awk 中,您可以执行以下操作:
awk '/^>/{
s="Not Streptomyces"
n=split($0,fields,";")
for(i=1;i<=n;i++) if (fields[i]=="Streptomyces") s="Streptomyces"
$0=">" s
} 1
' file
或者用 GNU awk 表示单词边界正则表达式:
gawk '/^>/ {
if ($0~/\<Streptomyces\>/)
$0="Streptomyces"
else
$0="Not Streptomyces"
}
1
' file
或者更简洁地说:
gawk '/^>/ { $0=">" ($0~/\<Streptomyces\>/ ? "" : "Not ") "Streptomyces" }1' file
或者,如果你可以相信开始总是,行总是结束(如你的示例),那么你可以做(在任何awk中):>Bacteria;
;
awk '/^>/ { $0=">" ($0~/;Streptomyces;/ ? "" : "Not ") "Streptomyces" } 1' file
红宝石:
ruby -lpe 'if /^>/ then $_ = /\bStreptomyces\b/ ? ">Streptomyces" : ">Not Streptomyces" end' file
这些印刷品的 AN:
>Streptomyces
TTGGCAGTCTCTCCCGCGAACCAGGCCACTGCTGCGACCACCTCGGCTGAATCCCGCGCGCAGGCCACGGGAATCCCCGG
>Not Streptomyces
TTGGCAGTCTCTCCCGCGAACCAGGCCACTGCTGCGACCACCTCGGCTGAATCCCGCGCGCAGGCCACGGGAATCCCCGG
1赞
markp-fuso
11/15/2023
#2
假设:
- 该物种将始终有一对分号 () 书挡
;
一个想法:awk
awk '
/^>/ { if ($0 ~ /;Streptomyces;/) # if header line and contains Streptomyces then ...
$0 = ">Streptomyces" # redefine current line
else # else ...
$0 = ">Not Streptomyces" # redefine current line
}
1 # print current line
' fasta.dat
另一个使用 shell 变量动态定义要搜索的物种的想法:awk
spec='Streptomyces' # shell variable assignment
awk -v species="${spec}" ' # set awk variable "species" to value of shell variable "spec"
/^>/ { if ($0 ~ ";" species ";") # if header contains our species then ...
$0 = ">" species
else
$0 = ">Not " species
}
1
' fasta.dat
这两者都会生成:
>Streptomyces
TTGGCAGTCTCTCCCGCGAACCAGGCCACTGCTGCGACCACCTCGGCTGAATCCCGCGCGCAGGCCACGGGAATCCCCGG
>Not Streptomyces
TTGGCAGTCTCTCCCGCGAACCAGGCCACTGCTGCGACCACCTCGGCTGAATCCCGCGCGCAGGCCACGGGAATCCCCGG
1赞
potong
11/16/2023
#3
这可能对你有用 (GNU sed):
sed -E 's/^>.*\b(Streptomyces)\b.*/>\1/I;t;s/^>.*/>Not Streptomyces/' file
如果以 和 开头的行包含单词 ,请将其替换为 。>
Streptomyces
>Streptomyces
否则,如果以 开头的行,请将其替换为 。>
>Not Streptomyces
0赞
ufopilot
11/16/2023
#4
$ awk -F';' -v spec=Streptomyces '/^>/{print($0~spec ? ">"spec : ">Not "spec); next}1' file
>Streptomyces
TTGGCAGTCTCTCCCGCGAACCAGGCCACTGCTGCGACCACCTCGGCTGAATCCCGCGCGCAGGCCACGGGAATCCCCGG
>Not Streptomyces
TTGGCAGTCTCTCCCGCGAACCAGGCCACTGCTGCGACCACCTCGGCTGAATCCCGCGCGCAGGCCACGGGAATCCCCGG
评论
awk '/^>/{$0=">"($0~q?"":"Not ")q}1' q=Streptomyces in.fasta >out.fasta
awk '/^>/{ $0 = ">" ( $0 ~ "(>|;)" q "(;|$)" ? "" : "Not " ) q } 1' q=Streptomyces in.fasta >out.fasta
?Not Streptomyces
Not_Streptomyces