Overview

  • Perform telomere estimation analysis on the MVP WGS data
  • We will be using Telseq to perform the estimation
  • All of the updates for the Telseq project can be seen here

Project Plan

Telseq implementation on GCP has 4 main steps

  • Read length (r) estimation followed by setting the telomere repeat length (k) for MVP samples. From the TopMed study, for a read length of 150 the telomere repeat length has been set as 12.

  • Telseq script creation : This script will take the converted .bam and provide it to telseq which then estimates the telomere length which is then stored in the output text file

  • Dockerize Telseq : Use the docker image provided by authors or create a new image. The main changes in the new dockerfile are changes in the version of samtools and telseq and has been pushed into dockerhub.
    Note - The current Telseq run was perfomed using the docker image generated by the authors

  • Dsub script to run Telseq on one .cram. This script will specify the locations for the input and output files along with the Telseq script location, machine type to be used, and other dependencies

  • Run the dsub job : bash telseq_dsub.sh

Approach

  • The output files generated from Telseq had were populated with only 0’s. None of the telomere leghts were estimated. We investigated the issue with the following steps.

  • First was to check if all necessary tools such as samtools, bamtools and Telseq are avaiable on the docker image.

  • Second we decided to check the samtools view command to make sure the .bam files were correct. To do this, instead of running telseq as a single line command, we decided to split the command into 2 lines. First to get an actual .bam file and second to input the converted bam to telseq. This way we could make sure that the size and contents of the .bam file were as expected

  • Once the bam files were created, we used ValidateSamFile, which is a part of the picard toolset, to check the quality of the .bam file. The bam file was empty/not fully converted. We fixed this issue by providing a reference .fasta and .fai file to the dsub script. For now I have specified seperate inputs for both the .fasta and .fai files, but for the next runs I plan to specify this using the INPUT_DIR tag, which will read the references recursively
    • Region : us-west1
    • Runtime per cram : 1hour 50 minutes
    • Machine type : n1-standard8
  • Although the dockerfile for TelSeq is provided, we can extract the dockerfiles for images on Docker using dfimages. For Telseq, the dockerfile does not show how samtools was installed, and hence decoding the image is helpful.

Next steps

  • Run Telseq for 10 CRAMs (ongoing)
  • Results will be compared with the standalone study performed by Kruthika
  • Perform quality checks
  • Scale up for production

Discuss this update on our GitHub!