Pipeline Configuration

This section deals with how to create end-user editable configuration files for parameters, variables and environment for your pipelines. Configuration is a key topic for computational pipelines as it is very common that a pipeline needs to be generalised - whether to run on different types of input data, using different computational resources, with different sets of reference data or other static resources, or many other types of settings that can differ between different contexts in which the pipeline may run. Therefore you need a way to specify sets of useful default values while allowing the end user to easily customise the values themselves in flexible ways.

Bpipe offers features to achieve this.

Precedence of Configuration Files

Although Bpipe offers several different ways to reference configuration settings, there is a common patterns to which are used in a setting where more than one value is available to choose from. This is:

  • if a setting is provided on the command line when Bpipe is invoked, this is used
  • otherwise, Bpipe will look for the setting in any configuration loaded from the local directory where the pipeline is running
  • otherwise, Bpipe will try to load the setting from configuration set inside the directory where the pipeline file is located
  • finally, Bpipe will load values from defaults that may be stored in .bpipeconfig in the user's home directory

With this in mind, if you are the pipeline author, the most common pattern is to put configuration settings into files that are stored in the same directory as your pipeline files. This way, the user can override them by adding configuration files into the directory where they are running the pipeline, or specify overrides on the command line.

In general, for user specific global preferences (for example, the user's email address), you would leave this as a setting for them to provide in their ~/.bpipeconfig file.

The Bpipe Config File

The main configuration file you should use for settings for your script should be the bpipe.config file. This file can be placed in the same directory as your pipeline file, and you can set in there configuration for both how commands are executed (directly on a server, or on an HPC cluster, for example), and also for the environment and settings those command use when they execute.

Tools

You can set the default location of some common tools that are used in computational workflows by setting their locations in the bpipe.config file:

  • groovy:
groovy {
    executable='/some/path/to/groovy/binary'
}
  • python
python {
    executable='/some/path/to/python/binary'
}
  • conda (anaconda executable)
conda {
    executable='/some/path/to/conda/binary'
}
  • R
R {
    executable='/some/path/to/Rscript/binary'
}

These configurations will be used when their inline scripting functions are invoked within Bpipe pipelines. In some cases relevant environment variables will also be inferred and set for your commands as well.

Command Configuration

The resources allocate to any given job often need to be customised to suit either the particular compute environment a pipeline is running in, or to the data that is being analysed.

To customise configuration for these settings, you can create a commands section in your bpipe.config file. The options for configuring commands are described in detail in Resource Managers.

Environment Variables

Environment variables can be set for commands using an env block. This can be set globally or within the specific commands section:

command {
    vep {
        env {
            PERL5LIB="/home/perl/lib"
        }
    }
}

Per-Environment Settings

Groovy offers a standard way to allow for multiple environments within a single configuration file. For example, you can have different settings for development, test and production within one bpipe.config file. To use a specific environment, pass the --env flag to Bpipe and provide the environment name. Then, you may create an environments block within your bpipe.config and put environment specific configuration within there, with one block for each environment containing overrides for that specific environment.


SERVER='http://dev.server/'

environments {
   prod {
      parameters {
          SERVER=http://prod.server/
          commands {
              bwa {
                memory="32g"
              }
          }
      }
   }
   test {
      parameters {
          SERVER=http://test.server/
      }
   }
}

Bootstrap Configuration

The ability to use many different configuration files, and capability of utilising arbitrary classes (including Java libraries) creates some new problems when these files are used in complex configurations:

  • how can you define common values that are used across the configuration files without duplicating their definition in each file?
  • how can you use your own libraries within the configuration files? For example, you may want to build your configuration through API calls, parsing files in custom formats or through database access. Or you may wish to define a configuration model using type safe classes.

To help solve this type of "meta configuration" problem, Bpipe provides a special configuration file option with unique behaviour: the "bootstrap" configuration, which is named bpipe.bootstrap.config, and must be located in the same directory as your main pipeline file. This file is unique in that:

  • it is loaded before any other configuration files
  • any libs defined in the bootstrap configuration are then available to the other configuration files when they are loading
  • variable definitions defined in the parameters block of the bootstrap configuration are accessible directly as variables within the other configuration files

Using this technique, you can make core common values or libraries that you want to be accessible across all other layers of configuration available - in some ways, you can think of it as "configuration configuration".

Note that values from the bootstrap configuration are also merged to the final configuration used in the pipeline, taking precedence after the pipeline level configuration (that is, if the pipeline configuration defines a value, it is used downstream in preference to the bootstrap value).

Loading Configuration Directly

While most configuration can be accomplished with a bpipe.config file you may prefer to separate configuration variables from the runtime execution configuration. If you would like to do this, a simple way is to use Bpipe's load statement to load a file, which is commonly called config.groovy in the same directory as your pipeline files.

For example you can have config.groovy:

REF='/some/reference/file.fasta'

And then a pipeline that makes use of the configured values:

load 'config.groovy'

do_something = {
    requires REF : "The reference file to use"
    exec "sometool -R $REF"
}

run {
    do_something
}

The user can still override the configuration in config.groovy by supplying a config.groovy in their own runtime folder, or they can override values individually when launching Bpipe:

bpipe run -p REF=/a/different/file.fasta pipeline.groovy ...