#### Next topic

Of handlers and targets

# Ratatosk configuration¶

## Configuration parser¶

ratatosk uses a yaml config parser that enforces section and subsections, treating everything below that level as lists/dicts/variables (see google app for nicely structured config files). An example is shown here:

section:
subsection:
options:
- option1
- option2


The parser maps everything below ‘options’ to regular python objects (list in this case). An option is retrieved via the function RatatoskConfigParser.get(). Every section heading maps to a ratatosk module, whereas the subsection heading maps to a task in the module. For instance, the module ratatosk.lib.align.bwa has a task Aln that can be configured as follows:

ratatosk.lib.align.bwa:
Aln:
options:
- -e 2
- -l 40


When the task is executed, it will run the command bwa aln -e 2 -l 40 ....

All tasks have a default requirement, which I call parent_task. In the current implementation, all tasks subclass BaseJobTask, which provides a parent_task class variable. This variable can be changed, either at the command line (option --parent-task) or in a configuration file. The parent_task variable is a string representing a class in a python module, and could therefore be any python code of choice. As an example, by default all ratatosk.lib.tools.picard tasks have as parent class ratatosk.lib.tools.picard.InputBamFile. This can easily be modified in the config file to:

ratatosk.lib.tools.picard:
InputBamFile:
HsMetrics:
targets: targets.interval_list
baits: targets.interval_list
DuplicationMetrics:
AlignmentMetrics:
InsertMetrics:

ratatosk.lib.tools.samtools:
SamToBam:


Note also that ratatosk.lib.tools.picard.InputBamFile has been changed to depend on ratatosk.lib.tools.samtools.SamToBam (default value is ratatosk.lib.files.external.BamFile).

## Resolving dependencies¶

The previous examples have assumed that tasks have one parent task. However, many applications depend on more than one input (Figure 1).

Figure 1. Excerpt from variant calling pipeline

Therefore, the parent_task variable can also be a list of tasks. For instance, in Figure 1, the dependencies for PrintReads would be defined by the following configuration:

ratatosk.lib.tools.gatk:
- ratatosk.lib.tools.gatk.DuplicationMetrics
- ratatosk.lib.tools.gatk.BaseRecalibrator
- ratatosk.lib.tools.gatk.PicardMetrics


Note

Setting additional parent tasks only work if 1) it is a wrapper task that generates its targets only from it’s own parents or 2) it uses the same target as the first default task

## Generating source names¶

Warning

The current implementation is confusing and will have to be reimplemented. See Source name generation.

Every class has a requires method that returns a list of parent tasks on which the current task depends. ratatosk dynamically loads the classes based on the names in parent_task and generates the required target names for the parent task in the method _make_source_file_name.

The procedure is best explained with an example. Consider figure 2, which is a simplified representation of figure 1, but with target file names in the boxes.

Figure 2. Excerpt from variant calling pipeline with target names. A dummy task has been added to illustrate a case where a parent has a label that should be removed from the child target name (e.g. for read suffixes in paired-end reads).

First, many tasks add labels to their output. Hence, every task has an attribute label. When the source file name is generated, the parent label is removed from the current task target name (example file.dup.realign.bam -> file.dup.bam. Second, in cases where there is a dependency on an ancestor task (DuplicationMetrics above), several labels should be removed. This is currently done with the attribute diff_label. Finally, some labels should be removed from parent to child - or added going “upwards”. Hence, the attribute add_label.

Confusing? Yes.