提问人:Marco 提问时间:7/9/2023 最后编辑:Marco 更新时间:7/10/2023 访问量:125
如何过滤一个 fastq 文件,发现一个退化模式?[已结束]
How to filter a fastq file grepping a degenerate pattern? [closed]
问:
我想过滤一个 fastq 文件,以便仅输出呈现特定模式的序列 (“..........华促会.....GTGG..........“,该点对应于具有相关质量的任何核苷酸A,C,G,T)。
即输入文件
@Reads1
AGCATTTGATATCAAATTTGGTGGATTGGTGTTGTGG
+
AAFFFJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJ
@Reads2
ATTATCACCAGGGCAACAAAAGTGGCCATGCATTGAGA
+
AAFFFJJJJJJJFJJJJJJJJJJJJJJJJJJJJJJJJJ
@Reads3
ATTATCAAAAAAAAACCCTTGGTGGCCATGCATTGAGA
+
AAFFFJJJJJJJFJJJJJJJJJJJJJJJJJJJJJJJJJ
输出文件:
@Reads1
CATTTGATATCAAATTTGGTGGATTGGTGTTG
+
AAFFFJJJJJJJJJJJJJJJJJJJJJJJJJJJ
@Reads2
ATCACCAGGGCAACAAAAGTGGCCATGCATTG
+
FFJJJJJJJFJJJJJJJJJJJJJJJJJJJJJJ
答:
1赞
kvantour
7/10/2023
#1
有大量工具可用于专门针对处理 FASTQ 或 FASTA 文件进行调整。但是,这是一个标准的 awk 程序。
我们假设 FASTQ 格式采用以下约定。
FASTQ 记录每个序列有四个以行分隔的字段:
- 字段 1 以“@”字符开头,后跟序列标识符和可选说明(如 FASTA 标题行)。
- 字段 2 是原始序列字母。
- 字段 3 以“+”字符开头,并可选择地后跟相同的序列标识符(和任何描述)。
- 字段 4 对字段 2 中序列的质量值进行编码,并且必须包含与序列中的字母相同数量的符号。
我们可以使用 awk 来重新创建它:
awk 'BEGIN{name=1;seq=2;comm=3;qual=0}
{ a[FNR%4]=$0; if (FNR%4) next }
# At this point we have the full record
# Perform sequence check and print if accepted:
match(a[seq],"..........CAA.....GTGG..........") {
print a[1]; print substr(a[2],RSTART,RLENGTH);
print a[3]; print substr(a[0],RSTART,RLENGTH)
}' file.fastq
评论
0赞
Marco
7/10/2023
它不起作用,因为它不会截断输出序列以仅具有模式。
0赞
kvantour
7/10/2023
@Marco 我已经更新了答案
1赞
Kaz
7/10/2023
#2
TXR 中的解决方案。
$ txr filter.txr data
@Reads1
CATTTGATATCAAATTTGGTGGATTGGTGTTG
+
FFFJJJJJJJJJJJJJJJJJJJJJJJJJJJJJ
@Reads2
ATCACCAGGGCAACAAAAGTGGCCATGCATTG
+
FFJJJJJJJFJJJJJJJJJJJJJJJJJJJJJJ
请注意,基于要提取每个部分中第二个核苷酸序列的相应子序列的假设,我们计算的值与预期数据之间存在差异:
$ diff <(txr filter.txr data) expected
4c4
< FFFJJJJJJJJJJJJJJJJJJJJJJJJJJJJJ
---
> AAFFFJJJJJJJJJJJJJJJJJJJJJJJJJJJ
目前还不清楚是否预计会有多个匹配项,第二个序列会相应地进行修剪。
法典:
@(repeat)
@@@heading
@before@{match /..........CAA.....GTGG........../}@nil
+
@other
@ (output)
@@@heading
@match
+
@{other [(len before)..(+ (len before) (len match))]}
@ (end)
@(end)
评论
bash
python
grep
sed