#28 MVP Release 2 variants #1: Analyzing variant info on BigQuery
663,351,127
318,687,479
627,219,244 (94.5%) variants are rare (MVP allele frequency <0.1%).
14,516,901 (2%) variants are uncommon (MVP allele frequency between .01% and 1%).
13,178,510 (1.9%) variants are common (MVP allele frequency > 1%).
Frequently Asked Questions
Introduction
Hail variant information
As we approach releasing our second set of whole genome sequencing variant data, researchers will be interested in understanding which variants are included in our dataset. The rows table of our Hail matrix table includes metadata and summary statistics for each variant, generated using the Hail variant_qc()
method. While the entire matrix table is almost 200 terabytes, the rows table is only a fraction of that size at 200 gigabytes. Sharing this information with users or potential users can allow them to better understand the dataset before trying to apply it to their research question.
Using BigQuery as a variant store
One option for sharing variant information is BigQuery, a high performance columnar database available on Google Cloud Platform.
Advantages
- Fast queries. BigQuery scales compute resources to match the size of our data and run queries quickly.
- Serverless. The infrastructure for running BigQuery is maintained by Google so you don’t have to set up, maintain, or configure any compute resources. And, it’s always available.
- Shareable. BigQuery allows you to share tables publicly as well as control which columns within a table are made public. In addition to sharing your own datasets, you can also your tables with existing public tables with datasets including gnomAD, dbSNP, and ClinVar.
- Machine learning. The new BigQuery ML library promises to run machine learning algorithms through SQL queries, though I’ve never used.
Obstacles
- Cost. Every query costs money and users need to be associated with a Google Cloud billing account to get started.
- Access. Users access BigQuery through the Google Cloud Platform web portal an must have a Google account associated with a billing account, to get started. For Million Veteran Program researchers, this could be a challenge.
- SQL. BigQuery uses the standard SQL query language and seems to have some BigQuery specific language quirks.
For our data, I think the ability to share variant data publicly is particularly valuable. It would offer potential researchers, collaborators, or researchers not yet part of the Million Veteran Program a relatively low cost (in terms of money and barrier to entry) option to explore the dataset. It’s also a solution already used by another large-scale genomic sequencing program (gnomAD).
Importing variant data into BigQuery
Get the variant information table
Joe had already exported the variant rows table from the Data Release 2 matrix table and stored it on cloud storage. The path to the file is stored as FINAL_VAR_INFO_PATH
in the file_key.yaml
object. I created a virtual machine (e2-medium) with 2TB of memory to do the transformation. The compressed rows object is only ~30 GB, but uncompressed it is 200 GB. I copied the compressed object to my virtual machine and then decompressed it for import into BigQuery.
NOTE: The decompression turned out to be time consuming and unnecessary. I initially thought I was going to directly import the table into BigQuery, but instead opted to convert to JSONL format and then import. In the future, I would use the
gzip
library to read the compressed table, perform transformations, and then write to JSONL.
Transform TSV formatted table to JSONL
The rows table is natively stored as a tab-separated values file. I initially tried importing a subset of the table directly into BigQuery but ran into problems, particularly trying to import array fields. Instead, I opted to convert the TSV rows table to JSON Lines format (JSONL) which generates larger input file(s) but I found to be easier and more reliable for importing data because it provides more explicit structure.
In addition to converting from TSV to JSONL, I also performed the following data transformations:
- Replace ‘.’ character in column names/keys with ‘_’. Conform to BigQuery naming conventions.
- Split the
locus
field into additionalcontig
andposition
columns. An example locus value looks like this:chr1:13047
. I imagine users will want to query and merge this table based on chromosome and base pair position. I think concatenating them like this makes it difficult to do that. - Convert array values from strings to arrays/repeated values. This should make it easier to parse array data and match values across columns (i.e. allele count for the first alternate allele).
I used a Python script to convert the entire 200 GB TSV to a 600 GB JSONL file, using a single processor. It took several days. Thinking about how to speed up this process, my initial thought was that I could shard the TSV file and then perform the TSV to JSONL conversion in parallel, but as I mentioned above, I would rather not have to decompress the TSV in the first place. So, I’m not sure. Anyway, I ended up just letting it run over the weekend, so it didn’t end up being that big a deal.
After generating a single very large JSONL, I used another Python script to split it into 600 1 GB shards. My thinking there was that BigQuery would be able to parallelize the upload if data was split across objects, but I wasn’t sure it would do that if all the data was stored in a single file (and I’m still not).
Import JSONL shards into BigQuery
Once I had the JSONL shards on my virtual machine, I transferred them to cloud storage and then used the bq
command line tool to import them into a BigQuery dataset. Copying the data to cloud storage took less than an hour and importing the entire dataset into BigQuery took about a minute.
Another convenient thing about using JSONL format is that I didn’t have to define the schema since it was able to automatically detect it. Though, it probably wouldn’t hurt to explicitly define a schema to ensure consistency with future releases.
BigQuery import command:
bq load --autodetect --source_format=NEWLINE_DELIMITED_JSON wgs_data_release_2.hail_variant_info gs://{wgs-data-release-2-bucket}/gvcf_aggregation_100k/release_20230505/rel2_100k_variant_jsonl_shards/shard_*.jsonl
Analyzing variant data in BigQuery
With the Data Release 2 variant data in BigQuery, I ran some SQL queries to explore the data and also try replicating results Joe had generated with Hail. For team members who would like to explore the data, the table is located in the production project as wgs_data_release_2.hail_variant_info
.
How many variants are in the dataset?
SELECT COUNT(1) AS variant_count
FROM `{mvp-project}.wgs_data_release_2.hail_variant_info`
Result
Row | variant_count |
---|---|
1 | 663351127 |
663,351,127 variants.
How many variants are singletons?
SELECT COUNT(1) AS singleton_count
FROM `{mvp-project}.wgs_data_release_2.hail_variant_info`
WHERE variant_qc_AC[1] = 1
Result
Row | singleton_count |
---|---|
1 | 318687479 |
318,687,479 of 663,351,127 (48%) variants are singletons.
For how many variants are there no alternate alleles in the dataset?
SELECT COUNT(1) AS no_alt_alleles
FROM `{mvp-project}.wgs_data_release_2.hail_variant_info`
WHERE variant_qc_AC[1] = 0
LIMIT 1000
Result
Row | no_alt_alleles |
---|---|
1 | 8436397 |
There are 8,436,397 variants without alternate alleles present.
How many rare variants are present?
SELECT COUNT(1) AS rare_variants
FROM `{mvp-project}.wgs_data_release_2.hail_variant_info`
WHERE variant_qc_AF[1] < 0.001
AND variant_qc_AF[1] > 0
Result
Row | rare_variants |
---|---|
1 | 627219244 |
627,219,244 of 663,351,127 (94.5%) variants are rare (MVP allele frequency <0.1%).
How many variants are uncommon?
SELECT COUNT(1) AS uncommon_variants
FROM `{mvp-project}.wgs_data_release_2.hail_variant_info`
WHERE variant_qc_AF[1] < .01
AND variant_qc_AF[1] > 0.001
Result
Row | uncommon_variants |
---|---|
1 | 14516901 |
14,516,901 of 663,351,127 (2%) variants are uncommon (MVP allele frequency between .01% and 1%).
How many variant alternate alleles are common?
SELECT COUNT(1) AS common_variants
FROM `{mvp-project}.wgs_data_release_2.hail_variant_info`
WHERE variant_qc_AF[1] > .01
Result
Row | common_variants |
---|---|
1 | 13178510 |
13,178,510 of 663,351,127 (1.9%) variant alleles are common.
Get variant counts by chromosome
SELECT contig, COUNT(1) AS variant_count
FROM `{mvp-project}.wgs_data_release_2.hail_variant_info`
GROUP BY contig
Results
Row | contig | variant_count | |
---|---|---|---|
1 | chr19 | 13562486 | |
2 | chr17 | 17934232 | |
3 | chr14 | 20338856 | |
4 | chr11 | 30797767 | |
5 | chr10 | 30079604 | |
6 | chr2 | 55235295 | |
7 | chr1 | 50354078 | |
8 | chr20 | 14268920 | |
9 | chr4 | 44120441 | |
10 | chr7 | 35944807 | |
11 | chr5 | 40802899 | |
12 | chr3 | 45357155 | |
13 | chr18 | 17247139 | |
14 | chr6 | 38513297 | |
15 | chr22 | 8765950 | |
16 | chrX | 33199253 | |
17 | chr8 | 35346863 | |
18 | chrY | 3602940 | |
19 | chr12 | 29719936 | |
20 | chr15 | 18658893 | |
21 | chr16 | 20654966 | |
22 | chr21 | 8443887 | |
23 | chr9 | 28315503 | |
24 | chr13 | 22085960 |
These results match Joe’s calculations performed in Hail, but should be much easier to generate.
Get average variant call rate by chromosome
SELECT contig, AVG(variant_qc_call_rate) AS avg_variant_call_rate
FROM `{mvp-project}.wgs_data_release_2.hail_variant_info`
GROUP BY contig
Results
Row | contig | avg_variant_call_rate | |
---|---|---|---|
1 | chr10 | 0.99567887972660751 | |
2 | chr19 | 0.99286330931143363 | |
3 | chr20 | 0.99528284648452714 | |
4 | chr14 | 0.99533632015242224 | |
5 | chr12 | 0.99585806523809539 | |
6 | chr15 | 0.99457693806272363 | |
7 | chr11 | 0.99599855061829679 | |
8 | chr1 | 0.99510411608390437 | |
9 | chr16 | 0.994414110412962 | |
10 | chr17 | 0.99427322989186062 | |
11 | chr22 | 0.992654612574794 | |
12 | chr13 | 0.99596520686626244 | |
13 | chr6 | 0.9961985031042655 | |
14 | chr4 | 0.99621042682936789 | |
15 | chr7 | 0.99549647752316439 | |
16 | chr3 | 0.99633056191751879 | |
17 | chr21 | 0.99372139594596853 | |
18 | chr8 | 0.99598523462520538 | |
19 | chr5 | 0.99630591361731413 | |
20 | chrY | 0.8645258836283678 | |
21 | chr2 | 0.995999184525582 | |
22 | chr9 | 0.99443882679993179 | |
23 | chr18 | 0.9963189734546698 | |
24 | chrX | 0.954729023651225 |
Count variants where alternate allele is only present in single individual
SELECT COUNT(1) AS alt_allele_in_single_person
FROM `{mvp-project}.wgs_data_release_2.hail_variant_info`
WHERE variant_qc_homozygote_count[1] = 1
OR variant_qc_AC[1] = 1
Result
Row | alt_allele_present_in_single_person |
---|---|
1 | 335112487 |
335,112,487 of 663,351,127 (50%) variants are only present in a single individual.
Future directions
- Join our variant table with functional information from other public BigQuery tables to functionally characterize Data Release 2 variants.
- Explore options for importing additional annotation resources into BigQuery.
- Consider whether it is worth making this information public, given the high probability of being able to reidentify individuals who have been sequenced.
Discuss
Join the discussion of this post on GitHub!