Details on workflow execution and intermediate data¶

Here we describe how intermediate steps are handled ete-build and how to access the intermediate files and data generated by the different tasks and jobs.

Requirements¶

ete3
ete_external_apps

Concepts¶

ete-build is a wrapper tool that allows to pipeline the execution of multiple programs. Although a final tree and alignment is reported as the main result o workflows, many other operations took place in the background.

In order to understand how ete-build works, consider the following facts:

A workflow is composed of many Tasks
Each Task is composed of none or multiple Jobs, plus some post-analysis operations (i.e. parsing, cleaning, etc.)
Each Job represents a call to an external program
When running a workflow using the mafft_linsi aligner, this is translated into a Mafft task that will call the mafft binary with the arguments and options under the mafft_linsi configuration block.
Each Job in task has a unique hash-id built using input data, program type and program arguments as source. A minimal change in one of the options would generate a different job-id.
Similarly, each task is assigned with a unique hash-id based on the configuration of the task and the ids of its sibling jobs.
all tasks and jobs ids, as well as the resulting data, are stored in a SQLITE database. A unique ID is also assigned to each piece of data generated (i.e. Multiple Sequence Alignment, Tree or trimmed alignment).

Altogether, this system is what permits reusing previous results when resuming an analysis. If a new tasks or job is registered that is present the database, the stored output will be used.

Common questions¶

1. Where to find job and task IDs¶

There are several ways:

while monitoring the execution, a 8-characters id is shown for each task and job.
After execution, a file called commands.log will be present in the output directory. It has the following tab-delimited format:

| TaskType | TaskId | JobName | JobID | command line used (if relevant)|

2. Where to find intermediate data¶

All intermediate operations occur in the tasks/ directory. Within tasks/, each Job and some Tasks store and process intermediate data. The name of subdirectories corresponds to Job or Task IDs.

The input/ folder is used to dump previously generated data with is stored in the database and that is required as input for other tasks. If a Job requires data files generated in previous tasks, those files will be referred using their corresponding data IDs and will be dumped in the input/ directory when necessary.

3. How to inspect data¶

All job directories follow the same basic structure. They contain:

a file __cmd__ with the command line used to launch the job
__stdout__ and __stderr__ files capturing job output
a __time__ file recording start and finish time of the job
a __status__ file reporting a single letter status about if the job is (D)one, (R)unning or has (E)rrors.
Any other additional files produced during job execution