Table Of Contents

Previous topic

Tasks

Next topic

Of handlers and targets

This Page

Ratatosk configuration

Configuration parser

ratatosk uses a yaml config parser that enforces section and subsections, treating everything below that level as lists/dicts/variables (see google app for nicely structured config files). An example is shown here:

section:
  subsection:
    options:
      - option1
      - option2

The parser maps everything below ‘options’ to regular python objects (list in this case). An option is retrieved via the function RatatoskConfigParser.get(). Every section heading maps to a ratatosk module, whereas the subsection heading maps to a task in the module. For instance, the module ratatosk.lib.align.bwa has a task Aln that can be configured as follows:

ratatosk.lib.align.bwa:
  Aln:
    options:
      - -e 2
      - -l 40

When the task is executed, it will run the command bwa aln -e 2 -l 40 ....

Working with parent tasks

All tasks have a default requirement, which I call parent_task. In the current implementation, all tasks subclass BaseJobTask, which provides a parent_task class variable. This variable can be changed, either at the command line (option --parent-task) or in a configuration file. The parent_task variable is a string representing a class in a python module, and could therefore be any python code of choice. As an example, by default all ratatosk.lib.tools.picard tasks have as parent class ratatosk.lib.tools.picard.InputBamFile. This can easily be modified in the config file to:

ratatosk.lib.tools.picard:
  InputBamFile:
    parent_task: ratatosk.lib.tools.samtools.SamToBam
  HsMetrics:
    parent_task: ratatosk.lib.tools.picard.SortSam
    targets: targets.interval_list
    baits: targets.interval_list
  DuplicationMetrics:
    parent_task: ratatosk.lib.tools.picard.SortSam
  AlignmentMetrics:
    parent_task: ratatosk.lib.tools.picard.SortSam
  InsertMetrics:
    parent_task: ratatosk.lib.tools.picard.SortSam

ratatosk.lib.tools.samtools:
  SamToBam:
    parent_task: ratatosk.lib.align.BwaSampe

Note also that ratatosk.lib.tools.picard.InputBamFile has been changed to depend on ratatosk.lib.tools.samtools.SamToBam (default value is ratatosk.lib.files.external.BamFile).

Resolving dependencies

The previous examples have assumed that tasks have one parent task. However, many applications depend on more than one input (Figure 1).

dupmetrics_to_printreads

Figure 1. Excerpt from variant calling pipeline

Therefore, the parent_task variable can also be a list of tasks. For instance, in Figure 1, the dependencies for PrintReads would be defined by the following configuration:

ratatosk.lib.tools.gatk:
  PrintReads:
    parent_task:
      - ratatosk.lib.tools.gatk.DuplicationMetrics
      - ratatosk.lib.tools.gatk.BaseRecalibrator
      - ratatosk.lib.tools.gatk.PicardMetrics

The order is important here. For gatk tasks, the first argument should be a bam/sam file. Since PrintReads also requires output from BaseRecalibrator, the second parent task is ratatosk.lib.tools.gatk.BaseRecalibrator. These are also the default parent tasks. In addition, the task PicardMetrics has been set as a parent task. Whenever you add more dependencies than defaults, ratatosk will try to load the additional parent, and if that fails, fall back on ratatosk.job.NullJobTask, a task that always succeeds.

Note

Setting additional parent tasks only work if 1) it is a wrapper task that generates its targets only from it’s own parents or 2) it uses the same target as the first default task

Generating source names

Warning

The current implementation is confusing and will have to be reimplemented. See Source name generation.

Every class has a requires method that returns a list of parent tasks on which the current task depends. ratatosk dynamically loads the classes based on the names in parent_task and generates the required target names for the parent task in the method _make_source_file_name.

The procedure is best explained with an example. Consider figure 2, which is a simplified representation of figure 1, but with target file names in the boxes.

issue_source_name_generation

Figure 2. Excerpt from variant calling pipeline with target names. A dummy task has been added to illustrate a case where a parent has a label that should be removed from the child target name (e.g. for read suffixes in paired-end reads).

First, many tasks add labels to their output. Hence, every task has an attribute label. When the source file name is generated, the parent label is removed from the current task target name (example file.dup.realign.bam -> file.dup.bam. Second, in cases where there is a dependency on an ancestor task (DuplicationMetrics above), several labels should be removed. This is currently done with the attribute diff_label. Finally, some labels should be removed from parent to child - or added going “upwards”. Hence, the attribute add_label.

Confusing? Yes.