#35 Annotating MVP Genomes with VEP & LOFTEE
Annotating MVP Genomes with VEP & LOFTEE
So far in our burden testing, our results have not matched Genebass’s, but it has been difficult to figure out where the difference are coming from – is it a) differences in the genomes themselves, b) differences in our algorithm, or c) differences in the variants’ annotations? Taking one at a time: The individual genomes are obviously different, but we have controlled for covariates like sex, ethnicity, and PCA scores, so this should not be a big factor. Our algorithm is the most basic form of burden testing, just a simple aggregation and linear regression, so we can probably rule it out as well. However, the input to our algorithm – the variants and their annotations – differ from Genebass’s, because we have been annotating our variants with gnomAD, whereas Genebass annotated theirs with VEP (Ensembl’s Variant Effect Predictor) and its LOFTEE plugin (Karczewski’s Loss-of-Function Transcript Effect Estimator).
The variants’ annotations, then, have become the focus of our troubleshooting. For background, annotating each variant with its corresponding gene and the likelihood of a loss of function resulting from that variant is one of the first steps in performing rare variant analysis; we are only interested in variants with a high-confidence loss-of-function effect upon the gene. At one point, we considered running VEP/LOFTEE ourselves, but chose not to, because it is built on technology that does not fit very well with our technology stack: VEP is written in Perl and meant to run on a single machine; Hail is written in Python and Scala and meant to run on a distributed computing cluster consisting of many individual machines. Scaling VEP would involve manually sharding and distributing a very large VCF file to a number of machines, then manually combining the results. The perceived difficulty of running VEP on our data led us to try other options first, such as gnomAD. (Incidentally, Hail includes a VEP function that takes a Hail-native table as its object and calls the VEP Perl script under the hood after splitting up the table and distributing the shards to worker nodes in the cluster, but so far it has not worked for various technical reasons).
Recently, however, we took another look at VEP, since our other efforts at troubleshooting had not panned out. This blog post details the process of running VEP on a manually managed “cluster” of virtual machines in Google’s cloud, including the various twists and turns the process took along the way.
Running VEP and LOFTEE on a single Compute VM
This was my first attempt at running VEP on a Compute VM. The steps I took, in order, are as follows:
- Created a new Compute VM instance:
dcotter-standalone-vep
withn1-highmem-32
machine type and a 20 Gb boot disk:gcloud compute instances create dcotter-vep-standalone \ --project=xxx \ --zone=us-west1-b \ --machine-type=n1-highmem-8 \ --network-interface=network=default,network-tier=PREMIUM,stack-type=IPV4_ONLY \ --maintenance-policy=MIGRATE \ --provisioning-model=STANDARD \ --service-account=963911152157-compute@developer.gserviceaccount.com \ --scopes=https://www.googleapis.com/auth/devstorage.read_only,https://www.googleapis.com/auth/logging.write,https://www.googleapis.com/auth/monitoring.write,https://www.googleapis.com/auth/servicecontrol,https://www.googleapis.com/auth/service.management.readonly,https://www.googleapis.com/auth/trace.append \ --create-disk=auto-delete=yes,boot=yes,device-name=dcotter-vep-standalone,image=projects/debian-cloud/global/images/debian-11-bullseye-v20230912,mode=rw,size=20,type=projects/xxx/zones/us-west1-b/diskTypes/pd-balanced \ --no-shielded-secure-boot \ --shielded-vtpm \ --shielded-integrity-monitoring \ --labels=goog-ec-src=vm_add-gcloud \ --reservation-affinity=any
- Created an additional 500 Gb data disk:
gcloud compute disks create data-disk \ --size=500GB \ --type=pd-balanced \ --region=us-west1 \ --zone=us-west1-b
- Attached the disk:
gcloud compute instances attach-disk dcotter-vep-standalone --disk=data-disk
- SSH’d into the VM and:
- Formatted the additional disk (
sdb
) according to these instructions:sudo mkfs.ext4 -m 0 -F -E lazy_itable_init=0,lazy_journal_init=0,discard /dev/sdb mkdir ~/data sudo mount -o discard,defaults /dev/sdb ~/data sudo chmod a+rwx ~/data sudo cp /etc/fstab /etc/fstab.backup echo UUID=`sudo blkid -s UUID -o value /dev/sdb` /home/dlcott2/data ext4 discard,defaults,noatime,nofail 0 2 | sudo tee -a /etc/fstab
- Installed Docker according to official instructions:
# Set up Docker's Apt repository # - add Docker's official GPG key: sudo apt-get update sudo apt-get install ca-certificates curl gnupg sudo install -m 0755 -d /etc/apt/keyrings curl -fsSL https://download.docker.com/linux/debian/gpg | sudo gpg --dearmor -o /etc/apt/keyrings/docker.gpg sudo chmod a+r /etc/apt/keyrings/docker.gpg # - add the repository to Apt sources: echo \ "deb [arch="$(dpkg --print-architecture)" signed-by=/etc/apt/keyrings/docker.gpg] https://download.docker.com/linux/debian \ "$(. /etc/os-release && echo "$VERSION_CODENAME")" stable" | \ sudo tee /etc/apt/sources.list.d/docker.list > /dev/null sudo apt-get update # Install the Docker packages. sudo apt-get install docker-ce docker-ce-cli containerd.io docker-buildx-plugin docker-compose-plugin # Verify that the installation is successful by running the hello-world image: sudo docker run hello-world
- Added my user to the Docker group according to these instructions:
sudo groupadd docker sudo usermod -aG docker $USER newgrp docker docker run hello-world
- Downloaded the Docker VEP image:
docker pull ensemblorg/ensembl-vep
- Downloaded the MVP DR2 VCF file and the VEP cache files and cloned the LOFTEE
repository onto the additional data disk:
# Enter shared directory (which will be mapped to /opt/vep/.vep in the Docker container) cd /home/dlcott2/data # Download MVP DR2 VCF file (VERY large, 237 GB) gsutil cp gs://xxx/gvcf_aggregation_100k/release_20230505/rel2_100k-var.vcf.bgz . # Download the VEP cache files and unpack them (also large, 20 GB) curl --remote-name https://ftp.ensembl.org/pub/release-110/variation/indexed_vep_cache/homo_sapiens_vep_110_GRCh38.tar.gz tar xzf homo_sapiens_vep_110_GRCh38.tar.gz # Clone LOFTEE plugin repo git clone https://github.com/konradjk/loftee.git
- Started the image interactively and mapped a local directory to a directory
in the container for file sharing:
# To log in as root, add "--user root" to the following: docker run --tty --interactive --volume /home/dlcott2/data:/opt/vep/.vep ensemblorg/ensembl-vep
- Ran VEP:
vep --cache \ --verbose \ --input_file rel2_100k-var.vcf.bgz \ --output_file rel2_100k-var-vep.tsv \ --host useastdb.ensembl.org \ --plugin LoF,loftee_path=/opt/vep/.vep/loftee \ --dir_plugins /opt/vep/.vep/loftee
- Ran VEP:
- Formatted the additional disk (
The output from VEP:
vep@46a76261b877:~/.vep$ vep --cache \
> --verbose \
> --input_file rel2_100k-var.vcf.bgz \
> --output_file rel2_100k-var-vep.tsv \
> --host useastdb.ensembl.org \
> --plugin LoF,loftee_path=/opt/vep/.vep/loftee/ \
> --dir_plugins /opt/vep/.vep/loftee/
Smartmatch is experimental at /opt/vep/src/ensembl-vep/modules/Bio/EnsEMBL/VEP/AnnotationSource/File.pm line 472.
2023-09-22 18:02:31 - Ignored unsupported option 'no_plugins=1' from environment variable VEP_NO_PLUGINS
2023-09-22 18:02:31 - Ignored unsupported option 'pluginsdir=/plugins' from environment variable VEP_PLUGINSDIR
2023-09-22 18:02:31 - Ignored unsupported option 'no_update=1' from environment variable VEP_NO_UPDATE
2023-09-22 18:02:31 - Set 'dir_plugins=/plugins' from environment variable VEP_DIR_PLUGINS
2023-09-22 18:02:31 - Ignored unsupported option 'no_htslib=1' from environment variable VEP_NO_HTSLIB
2023-09-22 18:02:31 - Read configuration from environment variables
Will only load v110 databases
Species 'homo_sapiens' loaded from database 'homo_sapiens_core_110_38'
Species 'homo_sapiens' loaded from database 'homo_sapiens_cdna_110_38'
Species 'homo_sapiens' loaded from database 'homo_sapiens_otherfeatures_110_38'
Species 'homo_sapiens' loaded from database 'homo_sapiens_rnaseq_110_38'
homo_sapiens_variation_110_38 loaded
homo_sapiens_funcgen_110_38 loaded
No ancestral database found
No ontology database found
Use of uninitialized value $taxonomy_db in concatenation (.) or string at /opt/vep/src/ensembl-vep/Bio/EnsEMBL/Registry.pm line 2355.
ensembl_taxonomy API not found - ignoring
Use of uninitialized value $ensembl_metadata_db in concatenation (.) or string at /opt/vep/src/ensembl-vep/Bio/EnsEMBL/Registry.pm line 2393.
ensembl_metadata API not found - ignoring
No production database or adaptor found
Smartmatch is experimental at /opt/vep/.vep/loftee/de_novo_donor.pl line 175.
Smartmatch is experimental at /opt/vep/.vep/loftee/de_novo_donor.pl line 214.
Smartmatch is experimental at /opt/vep/.vep/loftee/splice_site_scan.pl line 191.
Smartmatch is experimental at /opt/vep/.vep/loftee/splice_site_scan.pl line 194.
Smartmatch is experimental at /opt/vep/.vep/loftee/splice_site_scan.pl line 238.
Smartmatch is experimental at /opt/vep/.vep/loftee/splice_site_scan.pl line 241.
Use of uninitialized value in string eq at /opt/vep/.vep/loftee/LoF.pm line 348.
WARNING: Failed to instantiate plugin LoF: Can't open /vep/loftee/splice_data/donor_motifs/ese.txt: No such file or directory at /opt/vep/.vep/loftee/loftee_splice_utils.pl line 212.
2023-09-22 18:02:33 - No input file format specified - detected vcf format
gzip: stdout: Broken pipe
There are a few warning signs here, but the one that concerned me the most was VEP saying it couldn’t start LOFTEE:
WARNING: Failed to instantiate plugin LoF: Can't open /vep/loftee/splice_data/donor_motifs/ese.txt: No such file or directory at /opt/vep/.vep/loftee/loftee_splice_utils.pl line 212.
The program hadn’t returned, either in success or failure, so I just let it run for a while. There were no progress indicators, so in order to get an idea of what was going on, I ran ps
and wc -l rel2_100k-var-vep.tsv
(the specified output file). At some point, my VPN connection timed out and the SSH session froze its output, so I didn’t know whether the Docker container continued to run in the abandoned session or what. Clearly, I needed a better way of actually running the job.
Logging back in to check on things, I saw the final output of wc
was 1.2M lines and 174MB:
1200745 rel2_100k-var-vep.tsv
The first hundred lines of that file are:
## ENSEMBL VARIANT EFFECT PREDICTOR v110.1
## Output produced at 2023-09-22 18:02:33
## Connected to homo_sapiens_core_110_38 on useastdb.ensembl.org
## Using cache in /opt/vep/.vep/homo_sapiens/110_GRCh38
## Using API version 110, DB version 110
## ensembl-funcgen version 110.24e6da6
## ensembl-io version 110.b1a0d57
## ensembl version 110.584a8f3
## ensembl-variation version 110.d34d25e
## regbuild version 1.0
## sift version 6.2.1
## gnomADg version v3.1.2
## HGMD-PUBLIC version 20204
## 1000genomes version phase3
## polyphen version 2.2.3
## gencode version GENCODE 44
## genebuild version 2014-07
## gnomADe version r2.1.1
## assembly version GRCh38.p14
## COSMIC version 97
## dbSNP version 154
## ClinVar version 202301
## Column descriptions:
## Uploaded_variation : Identifier of uploaded variant
## Location : Location of variant in standard coordinate format (chr:start or chr:start-end)
## Allele : The variant allele used to calculate the consequence
## Gene : Stable ID of affected gene
## Feature : Stable ID of feature
## Feature_type : Type of feature - Transcript, RegulatoryFeature or MotifFeature
## Consequence : Consequence type
## cDNA_position : Relative position of base pair in cDNA sequence
## CDS_position : Relative position of base pair in coding sequence
## Protein_position : Relative position of amino acid in protein
## Amino_acids : Reference and variant amino acids
## Codons : Reference and variant codon sequence
## Existing_variation : Identifier(s) of co-located known variants
## Extra column keys:
## IMPACT : Subjective impact classification of consequence type
## DISTANCE : Shortest distance from variant to transcript
## STRAND : Strand of the feature (1/-1)
## FLAGS : Transcript quality flags
## VEP command-line: vep --cache --database 0 --dir_plugins [PATH]/ --input_file rel2_100k-var.vcf.bgz --output_file rel2_100k-var-vep.tsv --plugin [PATH]/ --verbose
#Uploaded_variation Location Allele Gene Feature Feature_type Consequence cDNA_position CDS_positioProtein_position Amino_acids Codons Existing_variation Extra
chr1_13047_C/A chr1:13047 A ENSG00000223972 ENST00000450305 Transcript non_coding_transcript_exon_variant 255 - - - - - IMPACT=MODIFIER;STRAND=1
chr1_13047_C/A chr1:13047 A ENSG00000290825 ENST00000456328 Transcript intron_variant,non_coding_transcript_variant - - - - - - IMPACT=MODIFIER;STRAND=1
chr1_13047_C/A chr1:13047 A ENSG00000227232 ENST00000488147 Transcript downstream_gene_variant - - IMPACT=MODIFIER;DISTANCE=1357;STRAND=-1
chr1_13047_C/A chr1:13047 A ENSG00000278267 ENST00000619216 Transcript downstream_gene_variant - - IMPACT=MODIFIER;DISTANCE=4322;STRAND=-1
chr1_13052_G/C chr1:13052 C ENSG00000223972 ENST00000450305 Transcript splice_region_variant,non_coding_transcript_exon_variant 260 - - - - - IMPACT=LOW;STRAND=1
chr1_13052_G/C chr1:13052 C ENSG00000290825 ENST00000456328 Transcript intron_variant,non_coding_transcript_variant - - - - - - IMPACT=MODIFIER;STRAND=1
chr1_13052_G/C chr1:13052 C ENSG00000227232 ENST00000488147 Transcript downstream_gene_variant - - IMPACT=MODIFIER;DISTANCE=1352;STRAND=-1
chr1_13052_G/C chr1:13052 C ENSG00000278267 ENST00000619216 Transcript downstream_gene_variant - - IMPACT=MODIFIER;DISTANCE=4317;STRAND=-1
chr1_13053_G/A chr1:13053 A ENSG00000223972 ENST00000450305 Transcript splice_donor_variant,non_coding_transcript_variant - - - - - - IMPACT=HIGH;STRAND=1
chr1_13053_G/A chr1:13053 A ENSG00000290825 ENST00000456328 Transcript intron_variant,non_coding_transcript_variant - - - - - - IMPACT=MODIFIER;STRAND=1
chr1_13053_G/A chr1:13053 A ENSG00000227232 ENST00000488147 Transcript downstream_gene_variant - - IMPACT=MODIFIER;DISTANCE=1351;STRAND=-1
chr1_13053_G/A chr1:13053 A ENSG00000278267 ENST00000619216 Transcript downstream_gene_variant - - IMPACT=MODIFIER;DISTANCE=4316;STRAND=-1
chr1_13053_G/C chr1:13053 C ENSG00000223972 ENST00000450305 Transcript splice_donor_variant,non_coding_transcript_variant - - - - - - IMPACT=HIGH;STRAND=1
chr1_13053_G/C chr1:13053 C ENSG00000290825 ENST00000456328 Transcript intron_variant,non_coding_transcript_variant - - - - - - IMPACT=MODIFIER;STRAND=1
chr1_13053_G/C chr1:13053 C ENSG00000227232 ENST00000488147 Transcript downstream_gene_variant - - IMPACT=MODIFIER;DISTANCE=1351;STRAND=-1
chr1_13053_G/C chr1:13053 C ENSG00000278267 ENST00000619216 Transcript downstream_gene_variant - - IMPACT=MODIFIER;DISTANCE=4316;STRAND=-1
chr1_13054_C/A chr1:13054 A ENSG00000223972 ENST00000450305 Transcript splice_donor_variant,non_coding_transcript_variant - - - - - - IMPACT=HIGH;STRAND=1
chr1_13054_C/A chr1:13054 A ENSG00000290825 ENST00000456328 Transcript intron_variant,non_coding_transcript_variant - - - - - - IMPACT=MODIFIER;STRAND=1
chr1_13054_C/A chr1:13054 A ENSG00000227232 ENST00000488147 Transcript downstream_gene_variant - - IMPACT=MODIFIER;DISTANCE=1350;STRAND=-1
chr1_13054_C/A chr1:13054 A ENSG00000278267 ENST00000619216 Transcript downstream_gene_variant - - IMPACT=MODIFIER;DISTANCE=4315;STRAND=-1
chr1_13058_C/G chr1:13058 G ENSG00000223972 ENST00000450305 Transcript splice_donor_region_variant,intron_variant,non_coding_transcript_variant - - - - - - IMPACT=LOW;STRAND=1
chr1_13058_C/G chr1:13058 G ENSG00000290825 ENST00000456328 Transcript intron_variant,non_coding_transcript_variant - - - - - - IMPACT=MODIFIER;STRAND=1
chr1_13058_C/G chr1:13058 G ENSG00000227232 ENST00000488147 Transcript downstream_gene_variant - - IMPACT=MODIFIER;DISTANCE=1346;STRAND=-1
chr1_13058_C/G chr1:13058 G ENSG00000278267 ENST00000619216 Transcript downstream_gene_variant - - IMPACT=MODIFIER;DISTANCE=4311;STRAND=-1
chr1_13073_C/A chr1:13073 A ENSG00000223972 ENST00000450305 Transcript intron_variant,non_coding_transcript_variant - - - - - - IMPACT=MODIFIER;STRAND=1
chr1_13073_C/A chr1:13073 A ENSG00000290825 ENST00000456328 Transcript intron_variant,non_coding_transcript_variant - - - - - - IMPACT=MODIFIER;STRAND=1
chr1_13073_C/A chr1:13073 A ENSG00000227232 ENST00000488147 Transcript downstream_gene_variant - - IMPACT=MODIFIER;DISTANCE=1331;STRAND=-1
chr1_13073_C/A chr1:13073 A ENSG00000278267 ENST00000619216 Transcript downstream_gene_variant - - IMPACT=MODIFIER;DISTANCE=4296;STRAND=-1
chr1_13074_T/C chr1:13074 C ENSG00000223972 ENST00000450305 Transcript intron_variant,non_coding_transcript_variant - - - - - - IMPACT=MODIFIER;STRAND=1
chr1_13074_T/C chr1:13074 C ENSG00000290825 ENST00000456328 Transcript intron_variant,non_coding_transcript_variant - - - - - - IMPACT=MODIFIER;STRAND=1
chr1_13074_T/C chr1:13074 C ENSG00000227232 ENST00000488147 Transcript downstream_gene_variant - - IMPACT=MODIFIER;DISTANCE=1330;STRAND=-1
chr1_13074_T/C chr1:13074 C ENSG00000278267 ENST00000619216 Transcript downstream_gene_variant - - IMPACT=MODIFIER;DISTANCE=4295;STRAND=-1
chr1_13076_G/A chr1:13076 A ENSG00000223972 ENST00000450305 Transcript intron_variant,non_coding_transcript_variant - - - - - - IMPACT=MODIFIER;STRAND=1
chr1_13076_G/A chr1:13076 A ENSG00000290825 ENST00000456328 Transcript intron_variant,non_coding_transcript_variant - - - - - - IMPACT=MODIFIER;STRAND=1
chr1_13076_G/A chr1:13076 A ENSG00000227232 ENST00000488147 Transcript downstream_gene_variant - - IMPACT=MODIFIER;DISTANCE=1328;STRAND=-1
chr1_13076_G/A chr1:13076 A ENSG00000278267 ENST00000619216 Transcript downstream_gene_variant - - IMPACT=MODIFIER;DISTANCE=4293;STRAND=-1
chr1_13080_G/A chr1:13080 A ENSG00000223972 ENST00000450305 Transcript intron_variant,non_coding_transcript_variant - - - - - - IMPACT=MODIFIER;STRAND=1
chr1_13080_G/A chr1:13080 A ENSG00000290825 ENST00000456328 Transcript intron_variant,non_coding_transcript_variant - - - - - - IMPACT=MODIFIER;STRAND=1
chr1_13080_G/A chr1:13080 A ENSG00000227232 ENST00000488147 Transcript downstream_gene_variant - - IMPACT=MODIFIER;DISTANCE=1324;STRAND=-1
chr1_13080_G/A chr1:13080 A ENSG00000278267 ENST00000619216 Transcript downstream_gene_variant - - IMPACT=MODIFIER;DISTANCE=4289;STRAND=-1
chr1_13084_G/C chr1:13084 C ENSG00000223972 ENST00000450305 Transcript intron_variant,non_coding_transcript_variant - - - - - - IMPACT=MODIFIER;STRAND=1
chr1_13084_G/C chr1:13084 C ENSG00000290825 ENST00000456328 Transcript intron_variant,non_coding_transcript_variant - - - - - - IMPACT=MODIFIER;STRAND=1
chr1_13084_G/C chr1:13084 C ENSG00000227232 ENST00000488147 Transcript downstream_gene_variant - - IMPACT=MODIFIER;DISTANCE=1320;STRAND=-1
chr1_13084_G/C chr1:13084 C ENSG00000278267 ENST00000619216 Transcript downstream_gene_variant - - IMPACT=MODIFIER;DISTANCE=4285;STRAND=-1
chr1_13087_AG/- chr1:13087-13088 - ENSG00000223972 ENST00000450305 Transcript intron_variant,non_coding_transcript_variant - - - - - - IMPACT=MODIFIER;STRAND=1
chr1_13087_AG/- chr1:13087-13088 - ENSG00000290825 ENST00000456328 Transcript intron_variant,non_coding_transcript_variant - - - - - - IMPACT=MODIFIER;STRAND=1
chr1_13087_AG/- chr1:13087-13088 - ENSG00000227232 ENST00000488147 Transcript downstream_gene_variant - IMPACT=MODIFIER;DISTANCE=1316;STRAND=-1
chr1_13087_AG/- chr1:13087-13088 - ENSG00000278267 ENST00000619216 Transcript downstream_gene_variant - IMPACT=MODIFIER;DISTANCE=4281;STRAND=-1
chr1_13087_A/G chr1:13087 G ENSG00000223972 ENST00000450305 Transcript intron_variant,non_coding_transcript_variant - - - - - - IMPACT=MODIFIER;STRAND=1
chr1_13087_A/G chr1:13087 G ENSG00000290825 ENST00000456328 Transcript intron_variant,non_coding_transcript_variant - - - - - - IMPACT=MODIFIER;STRAND=1
chr1_13087_A/G chr1:13087 G ENSG00000227232 ENST00000488147 Transcript downstream_gene_variant - - IMPACT=MODIFIER;DISTANCE=1317;STRAND=-1
chr1_13087_A/G chr1:13087 G ENSG00000278267 ENST00000619216 Transcript downstream_gene_variant - - IMPACT=MODIFIER;DISTANCE=4282;STRAND=-1
chr1_13088_G/- chr1:13088 - ENSG00000223972 ENST00000450305 Transcript intron_variant,non_coding_transcript_variant - - - - - - IMPACT=MODIFIER;STRAND=1
chr1_13088_G/- chr1:13088 - ENSG00000290825 ENST00000456328 Transcript intron_variant,non_coding_transcript_variant - - - - - - IMPACT=MODIFIER;STRAND=1
chr1_13088_G/- chr1:13088 - ENSG00000227232 ENST00000488147 Transcript downstream_gene_variant - - IMPACT=MODIFIER;DISTANCE=1316;STRAND=-1
chr1_13088_G/- chr1:13088 - ENSG00000278267 ENST00000619216 Transcript downstream_gene_variant - - IMPACT=MODIFIER;DISTANCE=4281;STRAND=-1
chr1_13088_G/C chr1:13088 C ENSG00000223972 ENST00000450305 Transcript intron_variant,non_coding_transcript_variant - - - - - - IMPACT=MODIFIER;STRAND=1
This looks like roughly what I was expecting as output from VEP, but I have two problems to deal with still:
- The VPN timeout is killing my SSH session with the Compute VM
- The LOFTEE plugin is not being loaded when VEP runs.
Regarding the first, Paul suggested a terminal multiplexer like screen
or tmux
, which would keep the session alive even after I was disconnected from the SSH session, which did the trick. Regarding the second, I found a solution, but I’m not totally sure why it works. Instead of pointing VEP to the LOFTEE plugin code in the cloned repository, I just copied all the code from the repo to the root-level plugins directory and ran VEP without specifying the LOFTEE plugin location.
NOT WORKING:
vep@46a76261b877:~/.vep$ vep --cache \
> --verbose \
> --input_file rel2_100k-var.vcf.bgz \
> --output_file rel2_100k-var-vep.tsv \
> --host useastdb.ensembl.org \
> --plugin LoF,loftee_path=/opt/vep/.vep/loftee/ \
> --dir_plugins /opt/vep/.vep/loftee/
WORKING:
cp -r /opt/vep/.vep/loftee/* /plugins
vep --cache \
--verbose \
--input_file /opt/vep/.vep/rel2_100k-var-chr1.vcf.bgz \
--output_file /opt/vep/.vep/rel2_100k-var-chr1-vep.tsv \
--host useastdb.ensembl.org \
--plugin LoF,loftee_path:/opt/vep/.vep/loftee/
I would normally hesitate to copy plugin code to a global directory and risk overwriting existing files, especially when they use common directory names like src
, but since this is running in a transient Docker container with no other users and a very temporary lifespan, I can’t see the harm.
Dealing with (very) big data
Joe sent the Storage location on 9/19 for the Data Release 2 VCF (gs://xxx/gvcf_aggregation_100k/release_20230505/rel2_100k-var.vcf.bgz
), a 236 GB compressed file. I was warned not to decompress it - after all, if it is 236 GB compressed, how many times larger would it be at full size? Better not to find out, so I restricted myself to tools that could manipulate the file in its compressed form (typically, these utilities decompress small parts of the file in memory, do their work, and write the results to disk, thus avoiding writing out the decompressed contents of the file to disk).
The next problem was to figure out how to shard the file into one file per chromosome so I can run VEP on smaller files one at a time. This is the header of the file:
$ zcat rel2_100k-var.vcf.bgz | grep -v "^##" | head -n 1
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT SHIP5297471 SHIP5508159...
The list of header columns goes on for many screens, and I suspected there were as many columns as samples (104,923 in DR2). To test, I counted the number of words in the header:
$ zcat rel2_100k-var.vcf.bgz | grep -v "^##" | head -n 1 | wc -w
104932
This amounts to 9 VEP columns (CHROM, POS, ID, REF, ALT, QUAL, FILTER, INFO, and FORMAT), plus 104,923 samples, so I was right. Then I wrote a quick Bash script to split the master file by contig:
# /bin/bash
for contig in 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 X Y M
do
zgrep -v "^##" rel2_100k-var.vcf.bgz | head -n 1 | gzip > chr$contig.gz
zgrep "^chr$contig" rel2_100k-var.vcf.bgz | gzip >> chr$contig.gz
done
Later, I asked Paul whether grep
was sufficient for splitting the file by chromosome – I was worried there might be exceptions to the simple regex I was using based on some anomalous chromosome names I’d run across – and he said bcftools
might be a better option. bcftools
, however, requires an index file for random access to the VCF, so the first step is to index the VCF file. Here is my revised process:
- Index the VCF file:
bcftools index rel2_100k-var.vcf.bgz
- Filter by region for each chromosome:
bcftools filter --regions chr1 > rel2_100k-var-chr1.vcf
Step 1 took 8 days to complete (from 9/27 to 10/4) and produced a 2.3 MB index file with a .csi
extension. In the meantime, I happened to notice an index file I hadn’t noticed before by the name of rel2_100k-var.vcf.bgz.tbi
in the same Storage directory Joe gave me earlier (gs://xxx/gvcf_aggregation_100k/release_20230505
). So recreating the index file was probably unnecessary versus just using the one that was already there that I didn’t notice when I first looked at the directory (which, to be fair, was before I knew I would need an index file or that it would take so long to generate one). By the time the index had been created, we had realized there was a better way to do what I was trying to do (see update below), so I nixed step 2.
Around the same time, we realized that the 236 GB file Joe uploaded turned out to include all 105k sample columns and blank genotype entries for each sample at each variant, which added a lot of unneeded bulk to the table. Apparently, the commands he used in Hail rendered the genotype entries blank but didn’t drop the sample columns, so each (blank) entry was represented by a single dot. Later that day, he recreated the variants-only dataset in Hail without the sample columns or genotype entries and split it out by chromosome. He posted the results, 24 separate .bgz
files, to gs://xxx/gvcf_aggregation_100k/release_20230505/var_vcfs
, and they are much smaller (from 12 MB to 180 MB each, for 24 files).
Distributing a large homo sapiens variations file to the worker nodes
My plan was to create one Compute VM per chromosome and run VEP on each of them, then stitch the results into a single output file – a sort of manually orchestrated compute cluster. The first problem I ran into was that VEP requires a homo sapiens variations file, which can either be downloaded on the fly from their FTP site or provided locally in a cache directory (they recommend the latter). I needed to get this file onto all 24 computers, and given it size, 20 GB, this was a non-trivial problem.
I had already downloaded the file to dcotter-vep-standalone
, the Compute VM I used when trying to run VEP on the 236-gigabyte master VCF file, so I had a copy of the file. My first thought was to upload this to Storage from the first VM and then download it onto the other 24 VMs. However, when I tried this, I got 403: Access Denied
errors, which persisted even after I changed the Storage scope on that VM to ReadWrite
.
Then I tried downloading it to my laptop so that I could upload it to Storage myself, since I knew I could upload files. But the download from their FTP site was extremely slow (300 KB/s), which I thought might be related to my being on Stanford’s VPN at the time, so I disconnected and tried again. Still slow, but marginally faster.
Then I tried using Globus Connect Personal to download it to my laptop, but I kept getting errors within the GCP app. Example:
2023-10-04 15:09:35 Connecting to server: relay.globusonline.org:2223
2023-10-04 15:09:37 [ssh stderr] Allocated port 36925 for remote forward to 127.0.0.1:52659
2023-10-04 15:09:37 [ssh stderr] rpc succeeded after 1 attempts(s)
2023-10-04 15:09:37 [ssh stderr] Relay Service Error: The endpoint was deleted on 2022-03-02 23:05:49.753800 UTC
2023-10-04 15:09:37 [ssh stderr]
I knew from past experience that Globus’s settings are frequently complicated and require a sysadmin with Globus expertise to debug, and since our Globus expert is filling in for our departing SCG sysadmin at the moment and very busy, I dismissed the idea of putting a ticket in with him for the Globus issue.
However, I tried downloading the file from Globus through the web app (i.e. not using Globus Connect Personal), and the average download speed is around 1 MB/s, so a 20 GB file should be done in 20,000 s or ~5.5 hours. Tolerable, but still slow, and if I have to upload it to Storage after that and download it onto 24 VMs after that, it would still delay me by days. I kept the download going while I thought about other options.
Finally, I hit on the idea that would drastically speed up the whole operation: Clone the disk through the Compute API. I ran the following:
gcloud compute disks create dcotter-vep-chry \
--project=xxx \
--type=pd-balanced \
--description=Includes\ 20G\ homo\ sapiens\ cache\ file\ required\ by\ VEP \
--size=500GB \
--zone=us-west1-b \
--source-disk=projects/xxx/zones/us-west1-b/disks/data-disk`)
It took less than a minute to create the disk; and after attaching the cloned disk to a new Compute VM and mounting the device, I can see the 20G file there, so I’m going to call it a success.
Creating a machine image to use on Compute VM
Now, I’m wondering if I can create an image that includes VEP, LOFTEE, the 20G cache file, and any other requirements for running VEP, and then use that image for each of the 24 VEP VMs? I don’t want to waste precious time on technical adventurism; on the other hand, it might pay off in a reusable image and some useful new knowledge. The GCP console notes that:
A machine image contains a VM’s properties, metadata, permissions, and data from all its attached disks. You can use a machine image to create, backup, or restore a VM.
The docs say an image is a good option for instance cloning, so I don’t think I’m far afield. Let’s give it a try:
gcloud beta compute machine-images create dcotter-vep-standalone-image \
--project=xxx \
--description=The\ image\ includes:$'\n'\ \ -\ VEP\ Docker\ image$'\n'\ \ -\ LOFTEE\ Git\ repo$'\n'\ \ -\ homo\ sapiens\ cache\ file\ \(20Gb\)\ and\ extracted\ directory\ tree \
--source-instance=dcotter-vep-standalone \
--source-instance-zone=us-west1-b \
--storage-location=us-west1
And now I’m creating an instance based on the image (this will turn out to work so well I end up using it on all 24 chromosome-specific VMs):
gcloud compute instances create dcotter-vep-standalone-chr01 \
--project=xxx \
--zone=us-west1-b \
--machine-type=n1-highmem-32 \
--network-interface=network=default,network-tier=PREMIUM,stack-type=IPV4_ONLY \
--maintenance-policy=MIGRATE \
--provisioning-model=STANDARD \
--service-account=963911152157-compute@developer.gserviceaccount.com \
--scopes=https://www.googleapis.com/auth/servicecontrol,https://www.googleapis.com/auth/service.management.readonly,https://www.googleapis.com/auth/logging.write,https://www.googleapis.com/auth/monitoring.write,https://www.googleapis.com/auth/trace.append,https://www.googleapis.com/auth/devstorage.read_write \
--min-cpu-platform=Automatic \
--no-shielded-secure-boot \
--shielded-vtpm \
--shielded-integrity-monitoring \
--labels=goog-ec-src=vm_add-gcloud \
--reservation-affinity=any \
--source-machine-image=dcotter-vep-standalone-image
After creating the VM based on the image:
- SSH in to instance:
gcloud compute ssh dcotter-vep-standalone-chr01
- Download the appropriate chromosome VCF:
gsutil cp gs://xxx/gvcf_aggregation_100k/release_20230505/var_vcfs/rel2_100k-var-chr01.vcf.bgz /home/dlcott2/data/
- Start a new
screen
:screen
- Start the Docker container and run VEP:
# Start the container interactively docker run --tty --interactive --volume /home/dlcott2/data:/opt/vep/.vep ensemblorg/ensembl-vep # From the Docker container: # - copy the LOFTEE plug into /plugins cp -r /opt/vep/.vep/loftee/* /plugins # - run VEP vep --cache \ --verbose \ --input_file /opt/vep/.vep/rel2_100k-var-chr01.vcf.bgz \ --output_file /opt/vep/.vep/rel2_100k-var-chr01-vep.tsv \ --host useastdb.ensembl.org \ --plugin LoF,loftee_path:/opt/vep/.vep/loftee/
- Detach the screen (Ctrl+A, d) and log out of the SSH session (Ctrl+d)
- Tar & zip the results and upload the results file to Storage once VEP
finishes. From the VM:
cd /home/dlcott2/data/ tar czf rel2_100k-var-chr2-vep.tar.gz rel2_100k-var-chr2-vep* gsutil cp rel2_100k-var-chr2-vep.tar.gz gs://xxx/gvcf_aggregation_100k/release_20230505/var_vcfs/
I ended up running all 24 Compute VMs in this manner and uploaded the results to gs://xxx/gvcf_aggregation_100k/release_20230505/var_vcfs
.
Aside: How much does an image cost?
$ gcloud compute machine-images describe dcotter-vep-standalone-image
creationTimestamp: '2023-10-05T13:46:39.427-07:00'
description: |-
Includes:
- VEP Docker image
- LOFTEE Git repo
- homo sapiens cache file (20Gb) and extracted directory tree
id: '2888490450307685954'
...
name: dcotter-vep-standalone-image
...
totalStorageBytes: '44591179008'
44,591,179,008 bytes = ~45 GB Machine image (per GB / month): $0.05 (source) 45 * 5 cents / month = $2.25 / month
Trying again with the proper LOFTEE release
When I looked at the results of the VEP runs on the 24 machines, they were not what I expected. The log file contained lots of warnings and errors, and the results files ended well before their respective chromosomes did (chrom. 15, for instance, stopped around ~40M bp, and the chromosome itself is ~102M bp long, so something went wrong). Well, it turns out I should have been using the LOFTEE GRCh38 release, rather than the latest version of the LOFTEE plugin repository. As it turns out, the master branch of the LOFTEE code repository is “very broken for 38” (according to Karczewski, the author of the plugin). I agree with the user who said “I think master being broken for GRCh38 needs to be added to the README on the front page. GRCh38 is the latest human reference and should be considered the default.”
In any case, I recreated the machine for chromosome 15 with GRCh38 release of LOFTEE (chromosome 15 because this is where ACAN, the top hit for our height phenotype lives, around ~88M bp). I checked on it after a few hours, and it looked much better. VEP got to the end of the chromosome (~102M base pairs) and flagged several variants within ACAN as high-confidence loss-of-function variants. Following on this success, I recreated the other 23 virtual machines with an updated image that used the GRCh38 release of LOFTEE. I also enabled Ops Agent, a Google Cloud Platform service that monitors memory usage of virtual machines (another thing that seems like it should be on by default). The code to create these machines was almost identical to the code already cited, so I won’t cite it again, but the code to install the Ops Agent was as follows:
:> agents_to_install.csv && \
echo '"projects/xxx/zones/us-west1-b/instances/dcotter-vep-standalone-chr15"' >> agents_to_install.csv && \
curl -sSO https://dl.google.com/cloudagents/mass-provision-google-cloud-ops-agents.py && \
python3 mass-provision-google-cloud-ops-agents.py --file agents_to_install.csv
The results were that VEP ran w/o significant errors or warnings on all 24 chromosomes, which I subsequently uploaded to gs://xxx/gvcf_aggregation_100k/release_20230505/var_vcfs
, with the input files. The next step is to download all these files to a single machine, filter them down to just the high-confidence loss-of-function variants, combine them into a master file, and import them into Hail to use as annotations to the MVP genomes. Hopefully, the resulting burden test will match Genebass’s fairly closely.
Minor problem: Rate-limiting
One problem yet to be solved is that, when creating the machines from the standalone-VEP image, I get the error “Operation rate exceeded for resource ‘projects/xxx/global/machineImages/dcotter-vep-standalone-image’. Too frequent operations from the source resource.” The rate limit appears to be fixed, according to GCP docs, not something where we can just pay more to use more.
In the moment, I just continued running the command every few minutes until all the instances were created, but Paul and I got together to talk about possible solutions. Paul had the idea to use a shared read-only disk; I liked the idea and thought that I could have LOFTEE, the homo sapiens cache file, and the Docker image on one read-only disk, then use a second, read-write, disk to launch the Docker container, write out the VEP results, and so on. It’s a more complicated configuration, but it would get around the rate-limiting problem.
Discuss!https://github.com/orgs/va-big-data-genomics/discussions/38