Variables in Bpipe

Bpipe supports variables inside Bpipe scripts. In Bpipe there are two kinds of variables:

Implicit variables

Implicit variables are special variables that are made available to your Bpipe pipeline stages automatically. The two important implicit variables are:

The input and output variables are how Bpipe automatically connects tasks together to make a pipeline. The default input to a stage is the output from the previous stage. In general you should always try to use these variables instead of hard coding file names into your commands. Using these variables ensures that your tasks are reusable and can be joined together to form flexible pipelines.

Extension Syntax for Input and Output Variables

Bpipe provides a special syntax for easily referencing inputs and outputs with specific file extensions. See ExtensionSyntax for more information.

Multiple Inputs

Different tasks have different numbers of inputs and outputs, so what happens when a stage with multiple outputs is joined to a stage with only a single input? Bpipe's goal is to try and make things work no matter what stages you join together. To do this:


  for(i in inputs) {
    exec "... some command that reads $i ..."
  }

Explicit Variables

Explicit variables are ones you define yourself. These variables are created inline inside your Bpipe scripts using Java-like (or Groovy) syntax. They can be defined inside your tasks or outside of your tasks to share them between tasks. For example, here two variables are defined and shared between two tasks:


  NUMTHREADS=8
  REFERENCE_GENOME="/data/hg19/hg19.fa"

  align_reads = {
    exec "bwa aln -t $NUMTHREADS $REFERENCE_GENOME $input"
  }

  call_variants = {
    exec "samtools mpileup -uf $REFERENCE_GENOME $input > $output"
  }

NOTE 1: it is important to understand that variables defined in this way have global scope and are also modifiable. This becomes important if you have parallel stages in your pipeline. Modifications to such variables, therefore, can result in race conditions, deadlocks and all the usual ills that befall multithreaded programming. For this reason, it is strongly recommended that you treat any such variables as constants which you assign once and then reference as read only variables in the remainder of your script.

NOTE 2: explicit variables can be assigned inside your own pipeline stages. However in current Bpipe they are assigned to the global environment. Thus even though you may assign a variable inside a pipeline stage it is not private to that pipeline stage. If you wish a variable to be private to a pipeline stage, prefix it with 'def'. If you want to share it with other pipeline stages that share the same branch (which will confine it to a single thread as well) you can prefix it with 'branch.':


  align_reads = {

    num_threads=8 // bad idea, this is a global variable!

    def thread_num=8 // good idea, this is private

    branch.thread_num = 8 // good idea if you want the variable to be visible 
                          // to other stages in the same branch

    exec "bwa aln -t $num_threads $REFERENCE_GENOME $input"
  }

Variable Evaluation

Most times the way you will use variables is by referencing them inside shell commands that you are running using the exec statement. Such statements define the command using single quotes, double quotes or triple quotes. Each kind of quotes handles variable expansion slightly differently:

Double quotes cause variables to be expanded before they are passed to the shell. So if input is "myfile.txt" the statement:


  exec "echo $input"

will reach the shell as exactly:


  echo myfile.txt

Since the shell sees no quotes around "myfile.txt" this will fail if your file name contains spaces or other characters treated specially by the shell. To handle that, you could embed single quotes around the file name:


  exec "echo '$input'"

Single quotes cause variables to be passed through to the shell without expansion. Thus


  exec 'echo $input'

will reach the shell as exactly:


  echo $input

Triple quotes are useful because they accept embedded newlines. This allows you to format long commands across multiple lines without laborious escaping of newlines. Triple quotes escape variables in the same way as single quotes, but they allow you to embed quotes in your commands which are passed through to the shell. Hence another way to solve the problem of spaces above would be to write the statement as:


  exec """
    echo "$input"
  """

See the exec statement for a longer example of using triple quotes.

Referencing Variables Directly

Inside a task the variables can be referenced using Java-like (actually Groovy) syntax. In this example Java code is used to check if the input already exists:


  mytask = {
      // Groovy / Java code!
      if(new File(input).exists()) {
        println("File $input already exists!")
      }
  }

You won't normally need to use this kind of syntax in your Bpipe scripts, but it is available if you need it to handle complicated or advanced scenarios.

Differences from Bash Variable Syntax

Bpipe's variable syntax is mostly compatible with the same syntax in languages like Bash and Perl. This is very convenient because it means that you can copy and paste commands directly from your command line into your Bpipe scripts, even if they use environment variables.

However there are some small differences between Bpipe variable syntax and Bash variable syntax:


  exec "for i in $(ls **.bam); do samtools index $i; done"

In this case the $ followed by the open bracket is illegal because Bpipe will try to interpret it as a variable. We can fix this with a backslash:


  exec "for i in \$(ls **.bam); do samtools index $i; done"

Environment Variables

If you reference a variable that has no definition, Bpipe evaluates it to its own name, prefixed by a $ sign. That is, inside an exec command, "$USER" would evaluate to "$USER" if undefined. This has the effect that variables are "passed through" as environment variables in commands. For example:

    exec """
        echo $USER
    """

Even though USER is not defined as a Bpipe variable, this will print the value of the USER environment variable. Note that Bpipe does not substitute the value of such a variable at all. The substitution is done by bash when the command is executed. Hence, if you are running commands using a computational cluster or similar, the value of the environment variable must be defined for jobs that run on the node where the command is executed.