#46 Polygenic Risk Scores & AI: Configuring pgsc_calc with ChatGPT
Polygenic Risk Scores & AI: Configuring pgsc_calc with ChatGPT
Recently, Rodrigo introduced our group to a software for calculating polygenic risk scores, pgsc_calc. I developed an interest in learning more about this software and running it on our Google Cloud architecture. My goal is to make this feature available to Rodrigo and the rest of our group for use with our data.
ChatGTP is an AI designed to respond to user input, typically in a chat box. Our internal group at Stanford has shown increasing interest in the applications of ChatGTP in our field. Owing to the challenges associated with installing bioinformatics software, I wanted to leverage ChatGTP to get pgsc_calc operational, while also getting a feel for the functionality of ChatGTP. The following text details the results of my discussions with ChatGTP.
Installation and Test Guide for PGSC_Calc with Nextflow on MacOS
Install Java on MacOS
First, check if Java is already installed:
java -version
If Java is already installed, this command will show the current Java version.
If Java is not installed:
- Visit Oracle’s Java SE Downloads Page
- Download the package relevant to MacOS
- After downloading, install Java by clicking on the downloaded file and following the instructions
Install Nextflow
Nextflow requires Unix-like operating system (Linux, Solaris, OS X, etc) and bash shell.
Confirm that Java v8 or later is installed on your computer:
java -version
Enter the following command to download and install Nextflow:
curl -s https://get.nextflow.io | bash |
This will create the Nextflow main executable file in the current directory. Optionally, move the Nextflow binary to a folder that’s included in your $PATH
:
echo $PATH
If /usr/local/bin/
is included in the output, move Nextflow to /usr/local/bin/
:
sudo mv nextflow /usr/local/bin/
Verify the installation using:
nextflow run hello.
Install Docker on MacOS
Docker can be installed on MacOS using Docker Desktop.
- Visit Docker’s Desktop for Mac installation page
- Download the Docker Desktop installer
- Open the downloaded Docker.dmg file and drag the Docker app to the Applications folder. Double click Docker.app to start Docker Desktop
- Confirm Docker Desktop is running properly by clicking on the Docker icon in your menu bar. It should show “Docker Desktop is running”
Running PGSC_Calc Pipeline Locally
Once Nextflow and Docker are successfully installed on your machine, you can run the PGSC_Calc pipeline locally using Docker.
To simply test the installation using the PGSC_Calc’s provided test profile, use the following Nextflow command:
nextflow run pgscatalog/pgsc_calc -profile test,docker
Please replace or adjust command parameters based on your specific data and requirements.
This guide should suffice to get you started running pgsc_calc
with Nextflow and Docker on a MacOS. Remember to handle your Docker credentials securely, if you’re working with private Docker images, and always manage your data securely.
Comments on running pgsc_calc locally
In the text above I set up my workstation to use a public Docker image of pgsc_calc. The steps provided included installing all necessary dependencies. This includes nextflow, which provides a simple way to run dockerized containers in environments such as Google Cloud. The git repo containing the public docker image also includes the two files necessary to run Nextflow - main.nf and nextflow.config. These two files represent the script to run the pipeline and the configuration file containing information on how to run it, respectively. In running this, I also added a local nextflow.config file to allow output data to be exported to my local machine. When running nextflow, it will prioritize the local configuration file, bypassing the one provided on the git repo. The configuration file contains one specification that provides the output path on my local machine. The file I created looks like this:
Replace Google Cloud Storage bucket with local directory path
workDir = '/Users/jsarro/pgsc-calc-pipeline/pgsc-calc-pipeline/local_output
The test run was successful using the provided test input. I next wanted to see if I could get pgsc_calc to run on Google Cloud through my workstation commands. I found there are three options for this; run on a virtual machine, run using Kubernetes, and run using Google Life Sciences. Because we are moving away from virtual machines due to FedRAMP restrictions, Kubernetes would be the best option moving forward. While Daniel works on getting Kubernetes configured, I wanted to see if I could run the test data using Google Life Sciences and then run a test using our own data. The following are the two options I worked out with ChatGPT for using either a public docker image (like above) or creating your own docker image.
Configuring and Running PGSC-CALC in Nextflow Using Google Life Sciences
The container image in the Nextflow configuration specifies the Docker image that will be used to run the processes in your workflow. This image should include all the necessary tools and dependencies required by the PGSC-CALC pipeline.
For PGSC-CALC, you can either use a publicly available image that contains the necessary tools or build your own Docker image with the required software. Here is how you can proceed with each option:
Option 1: Use a Publicly Available Image
If there is a publicly available Docker image that fits your needs, you can specify it directly in your nextflow.config
. For example:
process {
executor = 'google-lifesciences'
container = 'broadinstitute/gatk:4.1.8.1' // Replace with an appropriate image
}
google {
project = 'YOUR_PROJECT'
region = 'YOUR_REGION'
}
workDir = 'gs://YOUR_BUCKET_NAME/work'
Option 2: Build Your Own Docker Image
If you need a custom Docker image with specific tools and dependencies, you can create a Dockerfile and build your own image. Here’s an example Dockerfile for PGSC-CALC:
## Use a base image with Ubuntu
FROM ubuntu:20.04
## Install required packages
RUN apt-get update && apt-get install -y \
openjdk-11-jdk \
python3 \
python3-pip \
git \
curl \
&& apt-get clean
## Install Nextflow
RUN curl -s https://get.nextflow.io | bash \
&& mv nextflow /usr/local/bin/
## Install other dependencies for PGSC-CALC
## Add specific installation commands for the tools required by PGSC-CALC
## Set the default command
CMD ["nextflow", "-version"]
Build and push your Docker image to Google Container Registry:
-
Build the Docker Image:
docker build -t gcr.io/YOUR_PROJECT_ID/pgsc-calc:latest .
-
Push the Docker Image:
docker push gcr.io/YOUR_PROJECT_ID/pgsc-calc:latest
Update Your Nextflow Configuration
After building or choosing the Docker image, update your nextflow.config
to use this image:
process {
executor = 'google-lifesciences'
container = 'gcr.io/YOUR_PROJECT_ID/pgsc-calc:latest'
}
google {
project = 'YOUR_PROJECT'
region = 'YOUR_REGION'
}
workDir = 'gs://YOUR_BUCKET_NAME/'
Example of Running PGSC-CALC
After setting up the container image, you can run the PGSC-CALC workflow with Nextflow:
nextflow run pgsc-calc.nf -params-file params.input
Summary
- Container Image: This is a Docker image that contains all the tools and dependencies required to run your workflow.
- Public Image: Use a pre-built public image if it meets your needs.
- Custom Image: Build your own Docker image if specific tools or versions are required.
- Nextflow Configuration: Update
nextflow.config
to specify the container image and Google Cloud settings.
By following these steps, you should be able to configure and run Nextflow with PGSC-CALC using Google Life Sciences.
Comments on Running pgsc_calc in the Cloud
I tried using both of the above methods. I became interested in generating a custom Docker image for two reasons: First, in case we needed to modify any configurations, and second, to mitigate potential issues with FedRAMP when using a public Docker image. I discovered that some of the configurations required updates for my local system. Below are some of the changes I made with ChatGPT.
Clone the pgs_calc Repository
git clone https://github.com/PGScatalog/pgsc_calc.git
cd pgsc_calc
Create a Dockerfile
Create a Dockerfile in the pgsc_calc
directory with the following content:
## Use the official Ubuntu base image
FROM ubuntu:latest
## Install required system dependencies
RUN apt-get update && apt-get install -y \
python3 \
python3-pip \
python3-venv \
git \
&& apt-get clean
## Set the working directory
WORKDIR /app
## Clone the PGSC-Calc repository
RUN git clone https://github.com/PGScatalog/pgsc_calc.git /app/pgsc_calc
## Create a virtual environment and activate it
RUN python3 -m venv /app/venv
ENV PATH="/app/venv/bin:$PATH"
## Copy requirements.txt from the cloned repository into the Docker image
COPY pgsc_calc/requirements.txt /app/pgsc_calc/requirements.txt
## Install Python packages specified in requirements.txt within the virtual environment
RUN pip3 install --no-cache-dir -r /app/pgsc_calc/requirements.txt
## Make port 80 available to the world outside this container
EXPOSE 80
## Command to run the application
CMD ["/bin/bash"]
Build the Docker Image
docker build -t my_nextflow_image .
Set Up Google Cloud SDK and Authenticate
- Install Google Cloud SDK: Follow the instructions on Google Cloud SDK installation.
- Authenticate with Google Cloud:
gcloud auth login
gcloud config set project YOUR_PROJECT
Modify Nextflow Configuration to Use Google Cloud Storage
Update your nextflow.config file with the following content:
process {
executor = 'google-lifesciences'
container = 'my_nextflow_image' // Use your Docker image
}
google {
project = 'YOUR_PROJECT'
region = 'YOUR_REGION'
}
workDir = 'gs://YOUR_DIRECTORY'
Create the Nextflow Workflow Script
Create a main.nf
file in your project directory with the following content:
#!/usr/bin/env nextflow
params.test_file = '/app/test_file'
params.output_dir = 'gs://YOUR_BUCKET'
process pgs_calc {
container 'my_nextflow_image'
input:
path test_file from params.test_file
output:
path '**' into results
script:
"""
cd /app/pgsc_calc
python3 conftest.py ${test_file}
"""
}
results.view { it }
Run Nextflow
Execute the Nextflow workflow:
nextflow run main.nf -c nextflow.config
This configuration will allow you to run the pgs_calc
application using Nextflow, with outputs written to your specified Google Cloud Storage bucket.
Final comments and future work
I was not able to run pgsc_calc successfully on the Cloud using Google Life Sciences with either the public docker image or a custom-made one. I determined the issue lay in passing my Docker login credentials to Google Life Sciences. I tried storing them in secrets, but my permissions did not allow me to store them for the Google Life Sciences service account. The option of storing them somewhere else remained, but it introduced too many security risks.
Ancestry prediction using the PGScatalog
One of my main interests in using pgsc_calc was testing its ability to predict ancestry in our WGS dataset. I found that this could be achieved without running the pgsc_calc program itself. pgsc_calc uses an ancestry prediction script within PGScatalog-utils. The documentation for this tool is limited, but I have been able to install it locally and have begun looking at how it works. Alternatively, the same group provides another ancestry prediction method in its PCA analysis tool, Fraposa. This tool seems like it would suit our needs better, as it performs both PCA and ancestry prediction, both of which are values our group requires. This ancestry prediction tool could also be a better option, as it is designed to handle genotype data from large genomics datasets. In the near future, I would like to test both of these methods alongside Hail’s methodology on a portion of our dataset.
Join the discussion here!