Where are we?

Current data release progress can be tracked on our board. We have a final matrix table in hand. The table consists of 104,923 samples and 663,351,127 variants. All accompanying files have also been generated and stored on cloud, except for the VCF index file. Additionally, statistical analysis is complete, and the methods were updated to include the Asian population for this release. Figures and results have been updated in the slide deck (see below). Including the VCF index, I am actively working on three prerequisites to complete the release. All three of these will be discussed below.

VCF index

When Hail exports a VCF file, the option to include a tabix index is available. While the VCF itself took 18hrs to write, the index was taking multiple days. Lack of a multithreading option in tabix may be a major cause of this. Because of the cost concerns of running long jobs on our gcloud clusters, the decision was made to stop the index in hopes that the file would be able to be generated on the VA cluster. Unfortunately, limitations on the VA cluster made this task difficult, and the decision was made for the Palo Alto team to try again. It was found that Hail does not have a standalone tabix function. Instead, the only option to generate an index is to re-create the VCF file with the index option set to true. Re-writing a file that we already have and that takes a day to write is not an efficient solution. Instead I have created a gcloud virtual machine (VM), installed tabix, copied the VCF to the VM, and began indexing. The indexing is currently running and will be transferred to cloud storge when complete.

GWAS

Covariate data

The Jupyter notebook that Jina used to process GWAS for Data Release 1 is available. I have begun editing this notebook to update it for our second release. Many of these edits include removing Covid data and formatting how covariate data is input. I have found that we have most of the covariate data needed in a table we were given from the Boston team earlier in the year. This table is titled “wgs_shipping_id_ancestry_height.zip”. This table includes metadata such as hare, sex, height, and PCA score based on ethnicity (i.e., EUR_PCA_1, EUR_PCA_2…). We are, however, missing covariate data needed to complete GWAS analysis. These include:

  1. age
  2. age_sq
  3. BMI
  4. PCAs 1-10 for all samples

All this information is included in a file we have in our shared space named “phenotype_covid_wgs_covar.20210603.2.tsv”. However, this file only contains information for about 35k samples.

Alternatively to bullet point 4, in one of Jina’s last presentations before leaving, she showed her progress on generating PCA values with Hail. In this presentation she showed that PCA values can be generated in Hail and have comparable results to using PCA scores generated by the Boston team.

This brings up the question of whether we should try to generate PCA values in Hail for this release. The positives of this would be a lack of reliance on getting this data from Boston. The cons would be the learning curve, time needed, and needing to generate PCA scores for all universal samples in addition to all the ancestry groups.

Genotype GWAS data

In Jina’s GWAS notebook she imports two files that are noted as “GWAS results from Bryan”. These files are named EUR_GWAS.height.glm.linear.tsv.bgz and height_gwas.imputed_snps.WGS_samples.sumstats.tsv.gz.The latter of these files appears to be an updated version of the former and both contain GWAS results from the genotype array data that Jina used to compare to the WGS GWAS. A question for the Boston team is if an updated version of these files exist for the expanded dataset?

Data formatting for slide deck

I am currently gathering all the data we have generated into a slide deck much like the one Jina had presented for Data Release 1. The current slides can be found here.

Discuss

Join the discussion on our GitHub!