#5 Telomere length estimation for MVP samples
Overview
- Perform telomere estimation analysis on the MVP WGS data
- We will be using Telseq to perform the estimation
- All of the updates for the Telseq project can be seen here
Project Plan
Telseq implementation on GCP has 4 main steps
-
Read length (r) estimation followed by setting the telomere repeat length (k) for MVP samples. From the TopMed study, for a read length of 150 the telomere repeat length has been set as 12.
-
Telseq script creation : This script will take the converted
.bam
and provide it to telseq which then estimates the telomere length which is then stored in the output text file -
Dockerize Telseq : Use the docker image provided by authors or create a new image. The main changes in the new dockerfile are changes in the version of samtools and telseq and has been pushed into dockerhub.
Note - The current Telseq run was perfomed using the docker image generated by the authors -
Dsub script to run Telseq on one
.cram
. This script will specify the locations for the input and output files along with the Telseq script location, machine type to be used, and other dependencies -
Run the dsub job :
bash telseq_dsub.sh
Approach
-
The output files generated from Telseq had were populated with only 0’s. None of the telomere leghts were estimated. We investigated the issue with the following steps.
-
First was to check if all necessary tools such as samtools, bamtools and Telseq are avaiable on the docker image.
-
Second we decided to check the
samtools view
command to make sure the.bam
files were correct. To do this, instead of running telseq as a single line command, we decided to split the command into 2 lines. First to get an actual.bam
file and second to input the convertedbam
to telseq. This way we could make sure that the size and contents of the.bam
file were as expected - Once the bam files were created, we used ValidateSamFile, which is a part of the picard toolset, to check the quality of the
.bam
file. The bam file was empty/not fully converted. We fixed this issue by providing a reference.fasta
and.fai
file to the dsub script. For now I have specified seperate inputs for both the.fasta
and.fai
files, but for the next runs I plan to specify this using the INPUT_DIR tag, which will read the references recursively- Region : us-west1
- Runtime per cram : 1hour 50 minutes
- Machine type : n1-standard8
- Although the dockerfile for TelSeq is provided, we can extract the dockerfiles for images on Docker using dfimages. For Telseq, the dockerfile does not show how
samtools
was installed, and hence decoding the image is helpful.
Next steps
- Run Telseq for 10 CRAMs (ongoing)
- Results will be compared with the standalone study performed by Kruthika
- Perform quality checks
- Scale up for production
Discuss this update on our GitHub!