提问人:Drosera_capensis 提问时间:10/6/2023 更新时间:10/10/2023 访问量:35
在 Snakemake 中管理包装器
Manage the wrappers in Snakemake
问:
我正在尝试使用配置文件config.yaml制作Snakemake管道。
---
samples:
- Sample_1: reads/raw_reads_1
- Sample_2: reads/raw_reads_2
- Sample_3: reads/raw_reads_3
reference:
- "path/to/reference.fasta"
...
遵循 Snakemake 包装器存储库中给出的良好做法,我尽可能多地使用包装器命令。https://snakemake-wrappers.readthedocs.io/en/stable/但是,我观察到不同包装器的输入/输出不匹配。 有没有办法在不修改包装器的情况下管理规则之间的文件名以连接包装器?
import os
import snakemake.io
import glob
configfile: "config.yaml"
rule all:
input:
expand("mapped/{sample}.bam", sample=config["samples"])
rule fastqc:
input:
"reads/{sample}.fastq"
output:
html="qc/fastqc/{sample}.html",
zip="qc/fastqc/{sample}_fastqc.zip" # the suffix _fastqc.zip is necessary for multiqc to find the file. If not using multiqc, you are free to choose an arbitrary filename
params:
extra = "--quiet"
log:
"logs/fastqc/{sample}.log"
threads: 1
resources:
mem_mb = 1024
wrapper:
"v2.6.0/bio/fastqc"
rule bwa_index:
input:
"{genome}.fasta",
output:
idx=multiext("{genome}", ".amb", ".ann", ".bwt", ".pac", ".sa"),
log:
"logs/bwa_index/{genome}.log",
params:
algorithm="bwtsw",
wrapper:
"v2.6.0/bio/bwa/index"
rule bwa_mem:
input:
reads=["reads/{sample}.1.fastq", "reads/{sample}.2.fastq"],
idx=multiext("genome", ".amb", ".ann", ".bwt", ".pac", ".sa"),
output:
"mapped/{sample}.bam",
log:
"logs/bwa_mem/{sample}.log",
params:
extra=r"-R '@RG\tID:{sample}\tSM:{sample}'",
sorting="none", # Can be 'none', 'samtools' or 'picard'.
sort_order="queryname", # Can be 'queryname' or 'coordinate'.
sort_extra="", # Extra args for samtools/picard.
threads: 8
wrapper:
"v2.6.0/bio/bwa/mem"
答:
1赞
skanderbeg
10/10/2023
#1
包装器的目的只是包装如何做某事的逻辑。没有神奇的方法可以自动将规则的输入连接到输出。您的工作是为您的目的定义输入和输出值。包装器中的值不匹配,因为它们仅用于测试目的。
这就是全部答案,但是我将尝试提供一些示例,但是您对示例的配置不正确。
def get_sample_names():
# return list of sample names
rule all:
input:
bamqc=expand("mapped/{sample}.bam", sample=get_sample_names()),
fastqc=expand("qc/fastqc/{sample}_{pair}.html", sample=get_sample_names(), pair=["1","2"]),
rule bwa_index:
input:
"{prefix_reference}.fasta",
output:
idx=multiext("{prefix_reference}", ".amb", ".ann", ".bwt", ".pac", ".sa"),
log:
"{prefix_reference}/logs/bwa_index.log",
params:
algorithm="bwtsw",
wrapper:
"v2.6.0/bio/bwa/index"
rule bwa_mem:
input:
reads=[
"reads/{sample}.1.fastq.gz",
"reads/{sample}.2.fastq.gz",
],
idx=multiext(
os.path.splitext(config["reference"]),
".amb",
".ann",
".bwt",
".pac",
".sa",
),
output:
"mapped/{sample}.bam",
...
rule fastqc:
input:
"reads/{sample}.{pair}.fastq"
output:
html="qc/fastqc/{sample}_{pair}.html",
zip="qc/fastqc/{sample}_{pair}_fastqc.zip"
params:
extra = "--quiet"
log:
"logs/fastqc/{sample}_{pair}.log"
wrapper:
"v2.6.0/bio/fastqc"
在此示例中,我假设您的所有数据都在 .示例名称可以由配置提供,也可以为目录中的所有示例提供 glob。管理输入读取的另一种更复杂和更高级的方法是使用 pepfiles。
我建议进一步检查现有的工作流程,例如从目录中检查最新的工作流程reads/
reads
评论
0赞
Drosera_capensis
10/11/2023
谢谢@skanderbeg。我确实认为所有包装器都具有相同的变量,但这是由于我对 snakemake 的不熟悉。我想我正在尝试从头开始设置一个管道来了解 snakemake 的工作原理。但是,使用和分叉现有工作流确实更合理。使用 pepfile 可以很好地解决我遇到的问题,即将样本 ID 输入到我的原始读取文件中,该文件的名称不相关。我想使用 pepfiles 是可能的。再次感谢。
评论