Telomere length estimation in MVP WGS data

Leukocyte telomere length estimation for 103,763 MVP WGS genomes has been completed. The analysis notebook will soon be uploaded to the mvp-telomere-analysis git repo. Regarding the number of genomes:

  • 8562 genomes from LTL Pilot study are not a part of WGS DR2.
  • 808 genomes that were a part of WGS DR2 were not a part of the LTL study.

Sample count:

   
Number of genomes in WGS DR2 104,923
Number of genomes that have demographic information 104,901
Number of genomes in LTL Pilot study 84,677
Number of genomes in LTL 2nd run 31,873
Number of missing ID’s between WGS DR2 and LTL 808
Number of genomes in TelSeq 3rd run 788 (808-20)

Note: 20 genomes from WGS DR2 do not have cram locations available.

1. Filteration:

  • In total, Telomere length estimation has been completed for 117,338 genomes.
  • After removing the 8562 genomes (samples in Telseq Pilot but NOT a part of WGS DR2), the number drops down to 108,776.
  • From the 108776, the number of genomes common between WGS DR2 and TelSeq = 104,903.

Hare distribution within the 104,903 genomes:

hare count
EUR 72,916
AFR 24,612
ASN 5,548
HIS 687
NA 1118
  • NA (1118) samples have been removed for all subsequent analysis, dropping the number to 103,763.
  • 22 genomes do not have any demographic information.

2. Distribution of ‘Age’

2.1. Across 103,763 MVP WGS genomes

Distribution of 'Age' across 103,763 MVP WGS genomes

  Pilot study Final study
Mean age 65 years 65 years
Standard deviation 14 years 13 years
Minimum age 20 years 14 years
Maximum age 101 years 99 years

3. Distribution of estimated LTL across 103,763 MVP WGS genomes

Distribution of estimated LTL across 103,763 MVP WGS genomes

Mean LTL for the Pilot study was 2.02kb.

4. Correlation between ‘Age’ and ‘LTL’

4.1. Across all 103,763 genomes

Correlation between 'Age' and LTL across all genomes

Correlation co-efficient between Age and LTL in the Pilot study = -0.26 (p value < 1e-16)

4.2. Stratified by Sex

Correlation between 'Age' and LTL - stratified by Sex

Correlation co-efficient between Age and LTL in the Pilot study for each sex:

  • Male = -0.24 (p value < 1e-16)
  • Female = -0.24 (p value = 4.8252e-68)

4.3. Stratified by HARE

Correlation between 'Age' and LTL - stratified by HARE

Correlation co-efficient between Age and LTL in the Pilot study for each HARE group:

  • AFR = -0.23 (p value = 1.36e-225)
  • EUR = -0.27 (p value <1e-16)
  • HIS = -0.22 (p value <1e-16)
  • ASN = -0.21 (p value <1e-16)

5. Comparing estimated LTL across 103,763 genomes

5.1. Stratified by Sex

Comparing estimated LTL - stratified by Sex

Mean estimated LTL in Pilot study, by sex:

  • Male = 2.01kb
  • Female = 2.16kb

5.2. Stratified by HARE

Comparing estimated LTL - stratified by HARE

Mean estimated LTL in Pilot study, per HARE group:

  • AFR = 2.07kb
  • EUR = 2.01kb
  • HIS = 2.03kb
  • ASN = 2.0kb

6. Variation in correlation between Age and estimated LTL, every 20 years

Understanding how correlation between Age and LTL varies, every 20 years

Next steps:

  • Push completed notebook and figures to the GitHub repository
  • Share telomere results with DACS team

CHIP calling in MVP WGS data

GATK Mutect2 pipeline has been successfully completed for 100 genomes. The pipeline was executed using the docker image broadinstitute/gatk:latest. The docker image is bulky as it contains all the tools that are a part of the GATK suite, and hence requires a ‘boot-disk-space’ allocation of 20Gb. Along with this, the tool also requires a reference dictionary file used to describe a reference genome assembly. The other depencies include the input CRAM, index CRAI file, reference fasta file, list of CHIP driver genes and the germline resource file which is a population VCF of germline sequencing containing allele fractions.

Mutect2 script

#!/bin/bash 

#Extract file name
export CRAM_NAME="$(basename $CRAM .cram).vcf"

#Run Mutect2
gatk Mutect2 \
    -I $CRAM \ 
    -L $CHIP_GENES \ 
    -R $REF \ 
    --germline-resource $GERM_RES \
    --genotype-germline-sites \ 
    -O "$(dirname "${OUTPUT}")/${CRAM_NAME}"

dsub job script

dsub \
--provider google-v2 \
--project PROJECT_ID \
--regions us-west1 \
--machine-type n1-standard-1 \
--logging gs://PATH/TO/LOG/ \
--name chip_mvp \
--image broadinstitute/gatk:latest \
--boot-disk-size 20 \
--task "gs://PATH/TO/task.tsv" \
--script "gs://PATH/TO/mutect2_script.sh"

Tab-seperated task file

The TSV file utilized in the batch dsub script serves as a structured data input file, containing tab-separated values organized in rows and columns. Each row represents a distinct task or job to be executed by the dsub command, with each column corresponding to specific parameters or attributes associated with the tasks. For ease of reading the file has been transformed here. Ideally, column1 (Parameter) woud be row1 and column2 (File path) would be row2.

Parameter File path
–input CRAM gs://PATH/TO/.cram
–input CRAI gs://PATH/TO/.cram.crai
–input CHIP_GENES gs://PATH/TO/CHIP_NEJM2017_74genes.bed
–input GERM_RES gs://gatk-best-practices/somatic-hg38/af-only-gnomad.hg38.vcf.gz
–input GERM_RES_INDEX gs://gatk-best-practices/somatic-hg38/af-only-gnomad.hg38.vcf.gz.tbi
–input REF gs://genomics-public-data/resources/broad/hg38/v0/Homo_sapiens_assembly38.fasta
–input REF_FAI gs://genomics-public-data/resources/broad/hg38/v0/Homo_sapiens_assembly38.fasta.fai
–input REF_DICT gs://genomics-public-data/resources/broad/hg38/v0/Homo_sapiens_assembly38.dict
–output OUTPUT gs://PATH/TO//Output/*.vcf

VCF files for the initial 100 samples have been generated and are on GCS. The next steps would be:

  • Extract Mutect2 compute time for 100 genomes from the log files.
  • Cost estimation for 100 samples.
  • Update CHIP calling GitHub repository with scripts.
  • Share results for pilot with Alex and team to discuss next steps.

Join the discussion here!