#5 Telomere length estimation for MVP samples

Overview

Telseq implementation on GCP has 4 main steps

Read length (r) estimation followed by setting the telomere repeat length (k) for MVP samples. From the TopMed study, for a read length of 150 the telomere repeat length has been set as 12.
Telseq script creation : This script will take the converted .bam and provide it to telseq which then estimates the telomere length which is then stored in the output text file
Dockerize Telseq : Use the docker image provided by authors or create a new image. The main changes in the new dockerfile are changes in the version of samtools and telseq and has been pushed into dockerhub.
Note - The current Telseq run was perfomed using the docker image generated by the authors
Dsub script to run Telseq on one .cram. This script will specify the locations for the input and output files along with the Telseq script location, machine type to be used, and other dependencies
Run the dsub job : bash telseq_dsub.sh

The output files generated from Telseq had were populated with only 0’s. None of the telomere leghts were estimated. We investigated the issue with the following steps.
First was to check if all necessary tools such as samtools, bamtools and Telseq are avaiable on the docker image.
Second we decided to check the samtools view command to make sure the .bam files were correct. To do this, instead of running telseq as a single line command, we decided to split the command into 2 lines. First to get an actual .bam file and second to input the converted bam to telseq. This way we could make sure that the size and contents of the .bam file were as expected
Once the bam files were created, we used ValidateSamFile, which is a part of the picard toolset, to check the quality of the .bam file. The bam file was empty/not fully converted. We fixed this issue by providing a reference .fasta and .fai file to the dsub script. For now I have specified seperate inputs for both the .fasta and .fai files, but for the next runs I plan to specify this using the INPUT_DIR tag, which will read the references recursively
- Region : us-west1
- Runtime per cram : 1hour 50 minutes
- Machine type : n1-standard8
Although the dockerfile for TelSeq is provided, we can extract the dockerfiles for images on Docker using dfimages. For Telseq, the dockerfile does not show how samtools was installed, and hence decoding the image is helpful.

Discuss this update on our GitHub!