Here we describe how intermediate steps are handled ete-build
and how to access the intermediate files and data generated by the different tasks and jobs.
ete-build is a wrapper tool that allows to pipeline the execution of multiple programs. Although a final tree and alignment is reported as the main result o workflows, many other operations took place in the background.
In order to understand how ete-build works, consider the following facts:
A workflow is composed of many Tasks
Each Task is composed of none or multiple Jobs, plus some post-analysis operations (i.e. parsing, cleaning, etc.)
Each Job represents a call to an external program
When running a workflow using the mafft_linsi
aligner, this is translated into a Mafft task that will call the mafft
binary with the arguments and options under the mafft_linsi
configuration block.
Each Job in task has a unique hash-id built using input data, program type and program arguments as source. A minimal change in one of the options would generate a different job-id.
Similarly, each task is assigned with a unique hash-id based on the configuration of the task and the ids of its sibling jobs.
all tasks and jobs ids, as well as the resulting data, are stored in a SQLITE database. A unique ID is also assigned to each piece of data generated (i.e. Multiple Sequence Alignment, Tree or trimmed alignment).
Altogether, this system is what permits reusing previous results when resuming an analysis. If a new tasks or job is registered that is present the database, the stored output will be used.
There are several ways:
After execution, a file called commands.log
will be present in the output directory. It has the following tab-delimited format:
| TaskType | TaskId | JobName | JobID | command line used (if relevant)|
All intermediate operations occur in the tasks/
directory. Within tasks/
, each Job and some Tasks store and process intermediate data. The name of subdirectories corresponds to Job or Task IDs.
The input/
folder is used to dump previously generated data with is stored in the database and that is required as input for other tasks. If a Job requires data files generated in previous tasks, those files will be referred using their corresponding data IDs and will be dumped in the input/
directory when necessary.
All job directories follow the same basic structure. They contain:
__cmd__
with the command line used to launch the job__stdout__
and __stderr__
files capturing job output__time__
file recording start and finish time of the job__status__
file reporting a single letter status about if the job is (D)one, (R)unning or has (E)rrors.