Secret Manager

During the development of our last data release, Paul Billing-Ross introduced us to Secret Manager, and we had some success incorporating it into some of his notebooks. Secret Manager allows users to store hardcoded paths, derived from a YAML file, in a secret environment that can be accessed in the Python environment.

While processing our current data release, I generated a master list of paths for all files used and generated during the data release. This file is in YAML format and is stored in the root directory of our release bucket. I began trying to incorporate this YAML into Secret Manager, but I had limited success. Many of the issues I faced involved getting Secret Manager to work on Dataproc. As deadlines for the release loomed, I set a goal to get this working after the release had been completed. To prepare for this, I stripped any hardcoded paths in my Python scripts and notebooks before pushing to Git, replacing them with their corresponding variable name in the YAML file.

With FedRamp “ramping up,” privacy is becoming more important to our data than ever before. As we are testing the new system, beginning to incorporate Secret Manager seems to fit well with the theme of data security that encompasses FedRamp.

Sample Dashboard

One task I recently received at our in-person MVP meeting involved the generation of a sample dashboard. Rather than a dashboard you may be familiar with, i.e., our cloud dashboard GUI, this dashboard is a simple text file that keeps track of received and processed WGS samples. This file incorporates data from seven different sources:

  1. A file containing a list of all received samples from Personalis. This table was generated in Neo4j with the following query:
MATCH (n:PersonalisSequencing)
RETURN DISTINCT n.sample AS sample
  1. A file containing a list of samples that have been processed in our variant calling ($5 GATK) pipeline. This table was generated in Neo4j with the following query:
MATCH (n:PersonalisSequencing)<-[:WAS_USED_BY]-(:Sample)<-[:GENERATED]-(:Person)-[:HAS_BIOLOGICAL_OME]->(:Genome)-[:HAS_VARIANT_CALLS]->(v:Merged:Vcf)
RETURN DISTINCT n.sample AS sample
  1. A table containing a list of samples that have undergone QC. An additional column specifying whether a sample has passed, failed, or is missing QC info (NA) is also included. This file was generated with a Pyhton script (see below)

  2. A table of aggregated gVCF files. This table is stored as a Variant Dataset (VDS).

  3. A table of samples that have passed population QC thresholds. This table is in a Hail Matrix Table (MT) format.

  4. A final table of samples from Stanford before sending off to DACS. Samples such as technical duplicates are removed prior to table generation. This table is in MT format.

  5. A file of all samples released by DACS. This table accounts for additional removed samples prior to release, such as patient withdrawal. This file was sent to us by DACS.

The files are merged together to keep track of where a sample is at each step and will be merged with another table containing information on samples sent to Personalis (being generated by DACS).

I have generated the Stanford portion of the dashboard using two Python scripts. The first creates the QC info table (Table 3) and the second outputs the sample dashboard of all samples in the Stanford Google Cloud environment. While originally written without Secret Manager and validated for functionality, I have since worked to incorporate Secret Manager into both of these scripts before pushing to Git.

Configuring Secret Manager

Before beginning, please make sure you have Secret Manager installed on the machine that will be running your scripts. If this is running on a Dataproc cluster, you may need to install it on both the cluster and the machine you are submitting the job from.

pip install google-cloud-secret-manager

The next step is to generate a YAML file containing files stored on Google Cloud. Each file should contain an environmental variable (unique name) and the path to the file.

# YAML Header
env_variables:
  FILE: 'gs://path/file.txt'
  FILE2: 'gs://path/file2.txt'

After you have created your YAML file, you can store it and your working project in your environment.

gcloud secrets versions add my-yaml-config --data-file=/PATH/TO/YAML/file_key.yaml

export SECRET_NAME="my-yaml-config"
export GOOGLE_CLOUD_PROJECT="your_project_name"

As an alternative, you can also pass these variables directly in your Dataproc job submission script.

gcloud dataproc jobs submit pyspark --cluster=cluster_name \
    --region us-west1 \
    --properties="spark.executorEnv.GOOGLE_CLOUD_PROJECT=your_project_name,spark.executorEnv.SECRET_NAME=my-yaml-config" \
    test_secret.py

To load Secret Manager and work with it in your Python scripts, you can add a block of code similar to the following:

# Load secret YAML

import yaml
from google.cloud import secretmanager
import os

def get_yaml_config_from_secret():
    project_id = os.getenv('GOOGLE_CLOUD_PROJECT')
    secret_name = os.getenv('SECRET_NAME')

    if not project_id or not secret_name:
        raise ValueError("Project ID or Secret Name is not set in environment variables.")

    client = secretmanager.SecretManagerServiceClient()
    secret_version_path = f"projects/{project_id}/secrets/{secret_name}/versions/latest"

    response = client.access_secret_version(name=secret_version_path)
    secret_content = response.payload.data.decode("UTF-8")

    return yaml.safe_load(secret_content)

config = get_yaml_config_from_secret()

# Define variables
FILE = config['env_variables']['FILE']
FILE2 = config['env_variables']['FILE2']

As mentioned above, both scripts written to generate a table of QC stat information and the sample dashboard used these steps to allow paths to be specified with Secret Manager. Both files are currently available to review on our Git repo at create_aggregation_qc_dashboard_section and create_sample_dashboard.

Conclusion

While I have successfully incorporated Secret Manager into these scripts, I have yet to get these to run in Dataproc. The main reason for this is the complex nature of installing packages on the Dataproc image. I am currently working with ScalSec to make sure it is added to our FedRamp Dataproc image that we are currently testing, so this feature will be available in our new environment.

I have also begun to incorporate this methodology into additional scripts. Recently, I updated our PCA and GWAS methods to incorporate new phenotypes and updated sample mapping. Previously, we would simply strip our code of any project and bucket information. For this pull request, I have not just removed sensitive information, but also included the ability to load Secret Manager and specify files with their secret environmental variables. This not only adds scripts with stripped hardcoded paths, but it also adds scripts that can be run directly from our Git repo without adding path information. The long-term goal will be to have this feature incorporated into all of our scripts when we begin the next data release freeze.

Join the discussion here!