How Trellis uses dsub

As part of a review of Million Veteran Program data security practices, our team at Stanford wanted to verify that data within Google Cloud Platform was encrypted at all times, meaning both at rest and in transit.

MVP uses Trellis, “a cloud-based data management framework … that uses a graph database to automatically track data and coordinate jobs,” to handle the processing of genomic data after it is deposited in a Google Cloud Storage (GCS) bucket by the sequencing company. Here, “cloud-based” means both that the application interacts with data in the cloud, but also more broadly that it lives in the cloud:

The application follows a microservice architecture where each service is implemented as a serverless function and the state of the system is tracked using metadata stored in a Neo4j property graph database.

Another way of saying this is that Trellis is a distributed application, meaning it does not run as a process on a single server or even a fixed number of servers. Although this complicates the application architecture, this allows Trellis to use “cloud-native technologies optimized for massively parallel operations” and “automatically scale to meet the demands of large-scale analyses.” The combination of a distributed architecture and the use of

As an aside, Trellis is both an application and a model for a type of application. At Stanford, it is implemented on Google Cloud Platform (GCP) services, but it could just as well be built with analogous services from another cloud provider or even in a private on-premises cloud.

In any case, Trellis as implemented at Stanford uses several Google Cloud Platform (GCP) services:

  • Cloud Functions - to implement application logic
  • Cloud Storage - to store data
  • Container Registry - to store Docker images used by dsub
  • Pub/Sub - to relay messages between stateless cloud functions

In our security review, the first question we asked was whether data was encrypted when being transferred between two cloud services. We received assurance from Google that data exchanged between GCP services was encrypted in transit. However, we knew that within Trellis much of the actual processing of the data is handled by open-source bioinformatics utilities, which are in turn coordinated by an open-source job scheduler, dsub.

dsub is a “tool that makes it easy to submit and run batch scripts in the cloud … with Docker.” We wanted to verify that dsub handled data securely as well. Fortunately, dsub is written to be used with Google Cloud Storage, and all exchanges of data within dsub are handled by gsutil:

With dsub, your input files reside in a Google Cloud Storage bucket and your output files will also be copied out to Cloud Storage. When you submit a job with dsub:

  • Your input files will be automatically copied from bucket paths to local disk.
  • Your code will work on the local file system inside the Docker container.
  • Your output files will be automatically copied from local disk back to bucket paths.

I was only familiar with Trellis’s architecture at the high level described in the paper introducing it and still had a few questions about how dsub was being used:

  • Which Storage buckets will dsub be directed by Trellis to use?
  • Where are the Docker images dsub uses located?
  • How do the programs dsub is calling get loaded onto the Docker images?

I searched the code repository for dsub and found it was called from a number of cloud functions:

All of these cloud functions follow a very similar pattern, so I will use launch-cnvnator as a representative example for this blog post. Lots of code below, so feel free to skim, but I wanted to include it for documentation & future use. At line 141, there is a function called launch_dsub_task(dsub_args):

def launch_dsub_task(dsub_args):
    try:
        result = dsub.dsub_main('dsub', dsub_args)
    ...
    return(result)

This function is called exactly once, at line 299:

dsub_result = launch_dsub_task(dsub_args)

The parameter dsub_args is a key-value dictionary defined just previously, at line 253:

dsub_args = [
    "--name", f"{task_name}-{job_dict['inputHash'][0:5]}",
    ...
    "--image", job_dict["image"],
    ...
    "--script", job_dict["script"],
    "--input-recursive", job_dict["inputRecursive"]
]

The key-value dictionary job_dict, in turn, is defined at line 213, from environment variables, hardcoded values, and local variables (which are themselves derived from the Pub/Sub event and context passed into the cloud function):

job_dict = {
    ...
    "image": f"gcr.io/{TRELLIS.GOOGLE_CLOUD_PROJECT}/clinicalgenomics/cnvnator:0.4.1",
    ...
    "script": f"gs://{TRELLIS.TRELLIS_BUCKET}/functions/{FUNCTION_NAME}/CNVnator.sh",
    ...
    "inputs": {
        "BAM": f"gs://{cram['bucket']}/{cram['path']}",
        # Trying to resolve an issue using CRAMs(?): https://github.com/DecodeGenetics/graphtyper/issues/57
        "REF_CACHE_SOURCE": "gs://gcp-public-data--broad-references/hg38/v0/Homo_sapiens_assembly38.ref_cache.tar.gz"
    },
    "inputRecursive": f"DIR=gs://{TRELLIS.GOOGLE_CLOUD_PROJECT}-genomics-public-data/references/GRCh38/unzipped",
    "outputs": {
        "ROOT": f"gs://{TRELLIS.DSUB_OUT_BUCKET}/{plate}/{sample}/{task_name}/{task_id}/output/{study_metadata_path}/{sample}.root",
        "CALL_OUT": f"gs://{TRELLIS.DSUB_OUT_BUCKET}/{plate}/{sample}/{task_name}/{task_id}/output/{study_metadata_path}/{sample}.out",
        "EVAL_OUT": f"gs://{TRELLIS.DSUB_OUT_BUCKET}/{plate}/{sample}/{task_name}/{task_id}/output/{study_metadata_path}/{sample}.txt",
        "CALL_VCF": f"gs://{TRELLIS.DSUB_OUT_BUCKET}/{plate}/{sample}/{task_name}/{task_id}/output/{study_metadata_path}/{sample}.vcf",
        "GENOTYPE_OUT": f"gs://{TRELLIS.DSUB_OUT_BUCKET}/{plate}/{sample}/{task_name}/{task_id}/output/{study_metadata_path}/{sample}_genotype.out"
    },
    ...
    "sample": sample,
    "plate": plate,
    "name": task_name,
    ...
}

I have pared down the code to the entries we care about for the security review:

  • image: the Docker image dsub is using
  • script: the script it is running within the image (note, however, that some cloud functions use command instead of script, although they are similar in what they imply).
  • inputs: The input file(s) to the script
  • outputs: The output files(s) from the script

The values of these variables are parameterized by the variables:

  • TRELLIS.GOOGLE_CLOUD_PROJECT
  • TRELLIS.TRELLIS_BUCKET
  • TRELLIS.DSUB_OUT_BUCKET
  • FUNCTION_NAME

The first three are defined in a YAML configuration file stored in the Trellis bucket in Cloud Storage and the last by an environment variable provided by the Cloud Function service. The location of the configuration file, in turn, is specified by the environment variables CREDENTIALS_BUCKET and CREDENTIALS_BLOB, which are set by the developer on the Variables tab of the cloud function itself and are not publicly shared. These variables are set between lines 26 and 42 of our launch-cnvnator function:

ENVIRONMENT = os.environ.get('ENVIRONMENT', '')
...
if ENVIRONMENT == 'google-cloud':
    FUNCTION_NAME = os.environ['FUNCTION_NAME']

    vars_blob = storage.Client() \
                .get_bucket(os.environ['CREDENTIALS_BUCKET']) \
                .get_blob(os.environ['CREDENTIALS_BLOB']) \
                .download_as_string()
    parsed_vars = yaml.load(vars_blob, Loader=yaml.Loader)
    TRELLIS = Struct(**parsed_vars)

    PUBLISHER = pubsub.PublisherClient()
    CLIENT = storage.Client()

The other variables job_dict uses are set from lines 168-209 in the function launch_cnvnator and come from the event and context passed to the function by Trellis through the Pub/Sub (i.e. “publish–subscribe”) service.

def launch_cnvnator(event, context, test=False):
    """When an object node is added to the database, launch any
       jobs corresponding to that node label.

       Args:
            event (dict): Event payload.
            context (google.cloud.functions.Context): Metadata for the event.
    """

    # Parse message
    message = TrellisMessage(event, context)
    cram = message.results['cram']
    ...

    # Optional fields
    study = message.results.get('study')
    hospitalized = message.results.get('hospitalized')
    recvdActureCare = message.results.get('recvdActureCare')
    stayedInIcu = message.results.get('stayedInIcu')

    ...

    # Create unique task ID
    datetime_stamp = get_datetime_stamp()
    task_id, trunc_nodes_hash = make_unique_task_id([cram], datetime_stamp)

    # Database entry variables
    plate = cram['plate']
    sample = cram['sample']
    basename = cram['basename']

    study_metadata_path = f"study{study}/hospitalized{hospitalized}/recvdActureCare{recvdActureCare}/stayedInIcu{stayedInIcu}"

    task_name = 'cnvnator'

The entirety of cnvnator.sh, by the way, is as follows:

#!/bin/bash 
#CNVnator BASH Script

# https://github.com/DecodeGenetics/graphtyper/issues/57
echo "Untarring reference cache source"
tar xzf ${REF_CACHE_SOURCE} -C ${REF_CACHE_SOURCE%/*}

echo "Specifying paths to reference caches"
export REF_PATH="${REF_CACHE_SOURCE%/*}/ref/cache/%2s/%2s/%s:http://www.ebi.ac.uk/ena/cram/md5/%s"
export REF_CACHE="${REF_CACHE_SOURCE%/*}/ref/cache/%2s/%2s/%s"

echo "Running CNVnator"
cnvnator -unique -root ${ROOT} -tree ${BAM} -chrom $(seq -f 'chr%g' 1 22) chrX chrY chrM 
cnvnator -root ${ROOT} -his ${BIN_SIZE} -d ${DIR} 
cnvnator -root ${ROOT} -stat ${BIN_SIZE} 
cnvnator -root ${ROOT} -eval ${BIN_SIZE} > ${EVAL_OUT} 
cnvnator -root ${ROOT} -partition ${BIN_SIZE} 
cnvnator -root ${ROOT} -call ${BIN_SIZE} > ${CALL_OUT} 
perl /app/CNVnator_v0.4.1/src/cnvnator2VCF.pl ${CALL_OUT} ${DIR} > ${CALL_VCF} 
awk '{print $2} END {print "exit"}' ${CALL_OUT} | cnvnator -root ${ROOT} -genotype ${BIN_SIZE} > ${GENOTYPE_OUT}

The environment variables used in that script are passed in to the Docker image from the arguments passed to dsub in dsub_args.

I included what may be a tedious amount of detail above because there are so many steps involved in tracing back the location of the image, script, inputs, and outputs. For the security review, however, I summarized my findings in the following table:

dsub inputs & outputs

Function Storage URL input / output notes from trellis.yaml** from pub/sub msg*** hardcoded
launch-bam-fastqc gs://{bucket}/{path} input     bucket, path  
launch-bam-fastqc gs://{OUT_BUCKET}/{plate}/{sample}/{task_name}/{task_id}/output/{basename}.fastqc.data.txt output   OUT_BUCKET plate, sample, task_id, basename task_name
launch-cnvnator gs://{cram[‘bucket’]}/{cram[‘path’]} input     cram  
launch-cnvnator gs://gcp-public-data–broad-references/hg38/v0/Homo_sapiens_assembly38.ref_cache.tar.gz input public data      
launch-cnvnator gs://{TRELLIS.DSUB_OUT_BUCKET}/{plate}/{sample}/{task_name}/{task_id}/output/{study_metadata_path}/{sample}.root output   TRELLIS.DSUB_OUT_BUCKET plate, sample, task_id, study_metadata_path task_name
launch-cnvnator gs://{TRELLIS.DSUB_OUT_BUCKET}/{plate}/{sample}/{task_name}/{task_id}/output/{study_metadata_path}/{sample}.out output   TRELLIS.DSUB_OUT_BUCKET plate, sample, task_id, study_metadata_path task_name
launch-cnvnator gs://{TRELLIS.DSUB_OUT_BUCKET}/{plate}/{sample}/{task_name}/{task_id}/output/{study_metadata_path}/{sample}.txt output   TRELLIS.DSUB_OUT_BUCKET plate, sample, task_id, study_metadata_path task_name
launch-cnvnator gs://{TRELLIS.DSUB_OUT_BUCKET}/{plate}/{sample}/{task_name}/{task_id}/output/{study_metadata_path}/{sample}.vcf output   TRELLIS.DSUB_OUT_BUCKET plate, sample, task_id, study_metadata_path task_name
launch-cnvnator gs://{TRELLIS.DSUB_OUT_BUCKET}/{plate}/{sample}/{task_name}/{task_id}/output/{study_metadata_path}/{sample}_genotype.out output   TRELLIS.DSUB_OUT_BUCKET plate, sample, task_id, study_metadata_path task_name
launch-fastq-to-ubam gs://{bucket}/{path} input     bucket, path  
launch-fastq-to-ubam gs://{OUT_BUCKET}/{plate}/{sample}/{task_name}/{task_id}/output/{sample}_{read_group}.ubam output     plate, sample, task_id, read_group task_name
launch-flagstat gs://{bucket}/{path} input     bucket, path  
launch-flagstat gs://{OUT_BUCKET}/{plate}/{sample}/{task_name}/{task_id}/output/{basename}.flagstat.data.tsv output   OUT_BUCKET plate, sample, task_id, basename task_name
launch-gatk-5-dollar gs://{TRELLIS_BUCKET}/{GATK_MVP_DIR}/{GATK_MVP_HASH}/{GATK_GERMLINE_DIR}/google-adc.conf input   TRELLIS_BUCKET, GATK_MVP_DIR, GATK_MVP_HASH, GATK_GERMLINE_DIR    
launch-gatk-5-dollar gs://{OUT_BUCKET}/{plate}/{sample}/{task_name}/{task_id}/inputs/{sample}.google-papi.options.json input   OUT_BUCKET plate, sample, task_id, sample task_name
launch-gatk-5-dollar gs://{TRELLIS_BUCKET}/{GATK_MVP_DIR}/{GATK_MVP_HASH}/{GATK_GERMLINE_DIR}/fc_germline_single_sample_workflow.wdl input   TRELLIS_BUCKET, GATK_MVP_DIR, GATK_MVP_HASH, GATK_GERMLINE_DIR    
launch-gatk-5-dollar gs://{TRELLIS_BUCKET}/{GATK_MVP_DIR}/{GATK_MVP_HASH}/{GATK_GERMLINE_DIR}/tasks_pipelines/*.wdl input   TRELLIS_BUCKET, GATK_MVP_DIR, GATK_MVP_HASH, GATK_GERMLINE_DIR    
launch-gatk-5-dollar gs://{OUT_BUCKET}/{plate}/{sample}/{task_name}/{task_id}/inputs/inputs.json input   OUT_BUCKET plate, sample, task_id, sample task_name
launch-gatk-5-dollar gs://{OUT_BUCKET}/{plate}/{sample}/{task_name}/{task_id}/output output   OUT_BUCKET plate, sample, task_id, sample task_name
launch-text-to-table gs://{bucket}/{path} input     bucket, path  
launch-text-to-table gs://{OUT_BUCKET}/{plate}/{sample}/{task_name}/{task_id}/output/{task_group}/{basename}.csv output   OUT_BUCKET plate, sample, task_id, task_group, basename  
launch-vcfstats gs://{bucket}/{path} input     bucket, path  
launch-vcfstats gs://{OUT_BUCKET}/{plate}/{sample}/{task_name}/{task_id}/output/{sample}.rtg.vcfstats.data.txt output   OUT_BUCKET plate, sample, task_id task_name
launch-view-gvcf-snps gs://{vcf[‘bucket’]}/{vcf[‘path’]} input     vcf[‘bucket’], vcf[‘path’]  
launch-view-gvcf-snps gs://{index[‘bucket’]}/{index[‘path’]} input     index[‘bucket’], index[‘path’]  
launch-view-gvcf-snps SIGNATURE_SNPS input        
launch-view-gvcf-snps REF_FASTA input public data      
launch-view-gvcf-snps REF_FASTA_INDEX input public data      
launch-view-gvcf-snps gs://{TRELLIS.DSUB_OUT_BUCKET}/{plate}/{sample}/{task_name}/{task_id}/output/{sample}.signatureSNPs.vcf.gz output   TRELLIS.DSUB_OUT_BUCKET plate, sample, task_id task_name

And:

dsub images & scripts

Function dsub Docker image Script / command
launch-bam-fastqc gcr.io/XXXXXXX/biocontainers/fastqc:v0.11.5_cv4 fastqc.sh
launch-cnvnator gcr.io/XXXXXXX/biocontainer/clinicalgenomics/cnvnator:0.4.1 gs://XXXXXXX/functions/launch-cnvnator/CNVnator.sh
launch-fastq-to-ubam gcr.io/XXXXXXX/biocontainer/broadinstitute/gatk:4.1.0.0 /gatk/gatk –java-options ‘-Xmx8G -Djava.io.tmpdir=bla’ FastqToSam -F1 ${FASTQ_1} -F2 ${FASTQ_2} -O ${UBAM} -RG ${RG} -SM ${SM} -PL ${PL}
launch-flagstat gcr.io/XXXXXXX/biocontainer/biocontainers/samtools:v1.9-4-deb_cv1 samtools flagstat ${INPUT} > ${OUTPUT}
launch-gatk-5-dollar gcr.io/XXXXXXX/biocontainer/broadinstitute/cromwell:53 java -Dconfig.file=${CFG} -Dbackend.providers.${BACKEND_PROVIDER}.config.project=${PROJECT} -Dbackend.providers.${BACKEND_PROVIDER}.config.root=${ROOT} -jar /app/cromwell.jar run ${WDL} –inputs ${INPUT} –options ${OPTION}
launch-text-to-table gcr.io/XXXXXXX/biocontainer/stanfordbioinformatics/text-to-table:0.2.1 text2table -s ${SCHEMA} -o ${OUTPUT} -v series=${SERIES},sample=${SAMPLE_ID} ${INPUT}
launch-vcfstats gcr.io/XXXXXXX/biocontainer/realtimegenomics/rtg-tools:3.7.1 rtg vcfstats ${INPUT} > ${OUTPUT}
launch-view-gvcf-snps gcr.io/XXXXXXX/biocontainer/bschiffthaler/bcftools:1.11 bcftools view ${VCF} -R ${SNP_LIST} -Ou | bcftools convert –gvcf2vcf –fasta-ref ${REF_FASTA} -Ou | bcftools view -T ${SNP_LIST} -Oz -o ${OUTPUT}

In the end, what we found is that Trellis uses only one bucket for input and one for output and both are within the MVP Google Cloud project, meaning they are encrypted at rest and in transit between GCP services.

Other concerns

  • Trellis also uses Cromwell, another pipeline manager, “because the $5 GATK pipeline that we used for variant calling was already defined in Cromwell’s native Workflow Definition Language (WDL) and the pipeline had already been optimized to run on Google Cloud using Cromwell.”
  • Trellis also uses Neo4j, a graph database, to track the state of the system; however, Neo4j does not store any genomics data, only metadata relating to the job and data files, so it is not subject to the same scrutiny as the GCP components of Trellis.

References

Ross, P.B., Song, J., Tsao, P.S. et al. Trellis for efficient data and task management in the VA Million Veteran Program. Sci Rep 11, 23229 (2021). https://doi.org/10.1038/s41598-021-02569-5