#51 Upgrading Trellis Cloud Functions
What are Cloud Functions?
Cloud Functions are part of Google Cloud Platform’s suite of services. The generic term for these is ‘serverless functions,’ and all major cloud providers offer similar services: on Amazon Web Services (AWS), they are called ‘Lambda Functions’; on Azure, ‘Azure Functions’; and so on.
Serverless functions differ from traditional functions in several important ways. Traditionally, functions existed within a larger body of code, such as a script, library, or class. They referenced state (variables) within the context of that larger body of code. These functions were called by other functions or directly within the script.
Serverless functions, by contrast, exist independently, outside any larger body of code. They are not part of a class or library and do not reference state within such a context. Nor are they called by another function, class, or script. Instead, they operate as standalone functions in practice—for example, responding to external triggers such as HTTP requests.
Serverless functions are typically triggered by events. When defining a function on a cloud platform, it is associated with a specific event. When this event occurs, the function is executed, with details about the triggering event passed as input. Examples of triggering events include HTTP requests, files written to Cloud Storage, or messages added to a Publish/Subscribe queue or topic. While serverless functions are not stateless, any state they use is derived from the triggering event.
The term ‘serverless’ is somewhat misleading; it means the functions are not tied to a specific server. Instead, a temporary, ephemeral server is created to host the function whenever it is executed.
The difference between traditional and serverless functions parallels the difference between monolithic and microservices architectures. In a monolithic architecture, all code resides in one or several executables hosted on a single server or a small cluster. Scaling is achieved vertically, by upgrading machines, or horizontally, by adding servers, but neither process is easily automated. In contrast, a microservices architecture allows automatic scaling of individual services or functions in response to demand, leveraging the cloud’s extensive commodity hardware.
What do Cloud Functions do in Trellis?
The Cloud Functions are the core of Trellis’s business logic. As the Trellis paper puts it, “The application follows a microservice architecture where each service is implemented as a serverless function and the state of the system is tracked using metadata stored in a Neo4j property graph database.”
They coordinate interactions between the major subsystems of Trellis: the file
system (Cloud Storage), the database (Neo4j), and the bioinformatics pipelines,
orchestrated by dsub
and in the case of the GATK $5 pipeline, Cromwell.
What needs to be updated?
The Python runtime
When you look at the Cloud Functions in the Cloud Console, you will notice a yellow warning flag by the Runtime column for those still using Python 3.7. Hovering over the flag, a notice appears:
Python 3.7 will be deprecated on Jan 30, 2024. Please update this function using the latest runtime version available.
What happens if we don’t upgrade?
Given the looming deprecation when I began working on upgrading Trellis in early January 2024, I asked what exactly would happen when the runtime was deprecated: Would the function stop working? Or would the runtime just not continue to get security patches? According to our technical account manager at Google:
After the depreciation date, any functions you have running Python 3.7 will continue to exist and function as intended. However, deprecation means that the Python 3.7 runtime will no longer be actively maintained, and so it will not receive any important security or maintenance updates after the depreciation date. While the effects of this will probably not be immediate, it is highly advised you begin upgrading your functions to a supported runtime as soon as you are able to.
It is a moot point in any case what the exact effects of the deprecation would be, since FedRAMP authorization requires Trellis to use a non-deprecated runtime in order to ensure it receives important security updates.
Which runtime should we upgrade to?
The latest version of Python was v3.13, but I had reservations about being on the bleeding edge, so I checked which runtimes Google Cloud Platform would support:
% gcloud functions runtimes list --region us-west1
NAME STAGE ENVIRONMENTS
...
python37 DEPRECATED 1st gen
python38 DEPRECATED 1st gen, 2nd gen
python39 GA 1st gen, 2nd gen
python310 GA 1st gen, 2nd gen
python311 GA 1st gen, 2nd gen
python312 GA 1st gen, 2nd gen
...
So the latest runtime wasn’t even supported yet by GCP, which definitely ruled
it out. The second-to-latest runtime was supported; however, I chose to use the
one before that, v3.11, based on ChatGPT’s advice when I ran into an error
importing the Python libraries six
and pkg_resources
(more on that later):
If you’re using Python 3.13, note that many libraries might not yet fully support this version. Downgrade to Python 3.10 or 3.11 for better compatibility.
Why use Python 3.11 instead of 3.12?
The recommendation to use Python 3.10 or 3.11 instead of 3.12 (or 3.13 in your case) is due to compatibility concerns with third-party libraries. Here’s why:
Why Not Python 3.12?
- New Release Cycle:
- Python 3.12 was officially released recently (October 2023), and many libraries might not yet be fully tested or updated to support it.
- Some dependencies, particularly older or less actively maintained libraries, might still specify compatibility with versions up to 3.11.
- Library Ecosystem Lag:
- Popular libraries like
google-cloud-*
and others often take a few months to update their packages to ensure compatibility with new Python versions. The error involvingpkg_resources
andsix
suggests the library ecosystem for Google Cloud libraries isn’t fully ready for Python 3.12+.- Stability:
- If you’re working on a project where reliability is critical, sticking to an older, well-supported version of Python (such as 3.10 or 3.11) ensures stability.
- These versions are in their “long tail” phase, meaning they receive only critical bug and security fixes, making them highly stable.
Why Use Python 3.10 or 3.11?
- Wide Compatibility:
- Almost all actively maintained libraries support Python 3.10 and 3.11, making them safe choices.
- Python 3.10 introduced structural pattern matching and other useful features, while 3.11 brought performance improvements, so you’re not missing out on modern capabilities.
- Google Cloud Support:
- Google Cloud libraries officially support Python 3.10 and 3.11. While they may work on 3.12, it’s less certain and could lead to subtle issues.
- Tested Environment:
- These versions have been rigorously tested in production environments, ensuring your cloud functions work seamlessly.
That seemed like sensible advice, so I stuck with 3.11.
Imported libraries
Trellis’s Cloud Functions utilize a variety of Python libraries to perform their
tasks. The specific version of each library is defined in a file named
requirements.txt
, which Python’s package manager, pip
, uses as a list of
dependencies to install.
Why is it necessary to upgrade libraries in Trellis’s Cloud Functions if the functions’ code hasn’t changed? If the libraries worked when the functions were last updated (circa 2021), why might they no longer work today? After all, we are not utilizing any new features introduced in later versions of these libraries.
The answer lies in the dual purpose of library upgrades: they are both forward- and backward-looking. Upgrades introduce new features while also refining or fixing existing code. For Trellis, particularly in the context of FedRAMP compliance, the latter is our primary concern. Library upgrades are often published in response to security vulnerabilities and bugs, to ensure compatibility with other libraries, and to address other issues. In some cases, a library may no longer be maintained and must be replaced by a different one. For Trellis, upgrading to newer library versions is primarily about securing the system’s code and maintaining its integrity.
The upgrade
Upgrading the Cloud Functions’ runtimes
Trellis uses a Google Cloud Platform service called Cloud Build to automatically build and deploy Cloud Functions when the code for the function is updated in the GitHub repository.
In Trellis’s Cloud Functions repository, each function is in an eponymouse
subdirectory under /functions/
, along with a file named cloudbuild.yaml
specifying the steps required to build the function, including the Python
runtime to use.
As an example, here is the cloudbuild.yaml
file for create-blob-node
:
steps:
- name: 'ubuntu'
args: ['cp', '-r', 'config/${_DATA_GROUP}', 'functions/create-blob-node/']
- name: 'gcr.io/cloud-builders/gcloud'
args: [
'beta',
'functions',
'deploy',
'trellis-create-blob-node-${_BUCKET_SHORT_NAME}-${_OPERATION_SHORT_NAME}',
'--project=${PROJECT_ID}',
'--source=functions/create-blob-node',
'--memory=128Mi',
'--max-instances=100',
'--entry-point=create_node_query',
'--runtime=python311',
'--trigger-resource=${_TRIGGER_RESOURCE}',
'--trigger-event=google.storage.object.${_TRIGGER_OPERATION}',
'--trigger-location=us-west1',
'--update-env-vars=CREDENTIALS_BUCKET=${_CREDENTIALS_BUCKET}',
'--update-env-vars=CREDENTIALS_BLOB=${_CREDENTIALS_BLOB}',
'--update-env-vars=ENVIRONMENT=${_ENVIRONMENT}',
'--update-env-vars=TRIGGER_OPERATION=${_TRIGGER_OPERATION}',
'--update-env-vars=GIT_COMMIT_HASH=${SHORT_SHA}',
'--update-env-vars=GIT_VERSION_TAG=${TAG_NAME}',
'--update-labels=trigger-operation=${_OPERATION_SHORT_NAME}',
'--update-labels=trigger-resource=${_TRIGGER_RESOURCE}',
# Fix for logging issue: https://issuetracker.google.com/issues/155215191#comment112
'--update-env-vars=USE_WORKER_V2=true',
'--update-labels=user=trellis',
]
options:
logging: CLOUD_LOGGING_ONLY
This means that changing the Python runtime is as simple as replacing
--runtime=python37
in cloudbuild.yaml
with --runtime=python311
, then committing
and pushing the code to the repo of origin.
Troubleshooting
I started troubleshooting one function at a time, even though I suspected many of the functions would suffer from the same problems, since they repeated a lot of the same code. I figured once I had fixed the problem in one function, I could apply the fix in bulk to all the functions that suffered from the same problem.
Error #1: KeyError: ‘FUNCTION_NAME’
This error occurred in the following section of code in main.py
of the
function create-blob-node
. Identical code was found in many of the functions:
if ENVIRONMENT == 'google-cloud':
FUNCTION_NAME = os.environ['FUNCTION_NAME']
The code above is trying to retrieve a value from a key-value dictionary indexed
by the key. KeyError
means the key provided was not found in the dictionary.
Normally, this might indicate a programming error - either the key was
misspelled, the key-value pair hadn’t been set yet or had been deleted from the
dictionary, or something along those lines.
However, several things weighed against that conclusion. One, the function had
been building and running successfully before the runtime change. Two,
os.environ
is part of the os
(operating system) library in Python, and the
environment in this case was the GCP runtime environment, meaning this was not a
key-value pair defined by the user.
There are two ways environment variables can be defined in the GCP runtime
environment: by the user and by GCP. In the cloud function definition, it is
possible to set environment variables (in fact, there is an example of this
being done from the command line in the cloudbuild.yaml
code above). But
FUNCTION_NAME
was not and had not previously been set by the user, so it must
have been set at one time by GCP itself. Also, common sense would seem to
indicate that setting the function name is something the platform itself could
easily and automatically do, rather than making the user do extra work by
creating an environment variable in each function’s definition just to specify
the function’s name.
In that line of thinking, I started looking into how these values were set by
GCP, and this was harder than I expected. I expected to find a section of the
user’s manual saying which environment variables were set by GCP, and I did.
However, it took a bit more digging to find an archived version of the Cloud
Function documentation saying which environment variables used to be set by
GCP but were no longer, and what they had been replaced by. It turns out the
environment variable FUNCTION_NAME was provided by Google in Python runtime 3.7,
but not in subsequent runtimes. The equivalent environment variable in later
runtimes is FUNCTION_TARGET
, according to current documentation.
Once I replaced FUNCTION_NAME
with FUNCTION_TARGET
, the error stopped. This
same error recurred with the key GCP_PROJECT
, which had been replaced by
GOOGLE_CLOUD_PROJECT
, but this time the solution came much faster. Since this
problem affected many of the functions, I scripted a bulk replacement of the
legacy keys:
# From the repository root:
find -mindepth 2 -maxdepth 2 -type f -name 'main.py' |\
xargs sed -i -e 's/FUNCTION_NAME/FUNCTION_TARGET/g' -e 's/GCP_PROJECT/GOOGLE_CLOUD_PROJECT/g'
Error #2: ModuleNotFoundError: No module named ‘yaml’
The offending line of code was at the very beginning of the function:
import yaml
The library was used a bit farther down in loading the trellis.yaml
file
containing the database credentials:
vars_blob = storage.Client() \
.get_bucket(os.environ['CREDENTIALS_BUCKET']) \
.get_blob(os.environ['CREDENTIALS_BLOB']) \
.download_as_string()
parsed_vars = yaml.load(vars_blob, Loader=yaml.Loader)
I first checked the requirements.txt
file to see what YAML library was specified
for Cloud Build to install, but there was no YAML library specified. However,
the code had worked previously, and this led me to believe that, similarly to
the KeyError
above, this error was caused by something implicitly provided by
GCP in Python 3.7 that was no longer being provided.
Then the question became which YAML library the code had been using. Since the library wasn’t named in the code, all I had to go on was the members of the library actually called in the code. I had a hunch it was PyYAML, since this is one of the most popular YAML libraries, so I opened up a Python shell and checked for a few of the fields and method’s I had seen in the cloud functions’ code:
% python3.11 -m pip install pyyaml
Requirement already satisfied: pyyaml in /usr/local/lib/python3.11/site-packages (6.0.2)
% python3.11
Python 3.11.11 (main, Dec 3 2024, 17:20:40) [Clang 16.0.0 (clang-1600.0.26.4)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import yaml
>>> yaml.Loader
<class 'yaml.loader.Loader'>
>>> yaml.load
<function load at 0x105164220>
That was enough to satisfy my doubts, so I added pyyaml
to requirements.txt
as an explicit requirement.
Error #3: Invalid value specified for container memory
This error was caused by the parameter '--memory=128MB'
in the original
cloudbuild.yaml
:
ERROR: (gcloud.beta.functions.deploy) ResponseError: status=[400], code=[Ok],
message=[Could not create Cloud Run service
trellis-create-blob-node-from-personalis-final.
spec.template.spec.containers[0].resources.limits.memory: Invalid value
specified for container memory. For 0.083 CPU, memory must be between 128Mi
and 512Mi inclusive.
The difference between 128 MB and 128 Mi may not be obvious at first, even to people in technology. Megabytes (MB) and mibibytes (Mi) are similar quantities, but they are defined slightly differently:
1MB = 10^6 bytes = 1,000,000 bytes 1MiB = 2^20 bytes = 1,048,576 bytes
Megabytes are by far the more commonly used unit and have the nice property of
being easily remembered round numbers, but mibibytes are powers of 2 and
therefore more accurate representations of storage and memory sizes. In the case
of our cloud functions, 128MB < 128Mi < 512Mi, therefore not just in the
wrong units but also outside the acceptable range. I changed 128MB
to 128Mi
,
and the error cleared.
Upgrading the required libraries
Compiling a list of specific library versions that are mutually compatible is a non-trivial task. Fortunately, there are specialized tools designed to do it quickly and accurately. The challenge, broadly speaking, lies in the fact that the libraries required by the code at hand often depend not only on other libraries but on specific versions of those libraries. Sometimes, these version requirements are mutually incompatible or only compatible within certain version ranges.
For example, library A might require version 2.0 or higher of library B, while library C requires version 1.9 or lower of library B. In this case, the requirements of library A and library C are mutually incompatible. More commonly—and less problematically—library A may require library B version 2.0 or higher, while library C requires version 2.2 or higher. Here, the resolved requirement would be the range of versions acceptable to both, which is version 2.2 or higher.
You can imagine how tedious and error-prone it would be to work this out
manually, especially since each library often depends on lower-level libraries,
which in turn depend on even lower-level libraries, cascading all the way down
to the Python standard libraries included with the Python runtime. That’s where
a tool like pip-compile
comes in. It resolves these dependencies for you and
provides a mutually compatible list of libraries with specified (or “pinned”)
versions.
This is, incidentally, why developing in a virtual environment is a good idea. A virtual environment allows you to fulfill the requirements of a specific project in isolation, preventing your local system from becoming cluttered with different library versions that are likely to clash and cause broken code and confusion—commonly referred to as “dependency hell.”
Here’s how to create a virtual environment for testing library installations:
% python3.11 -m venv venv
% source venv/bin/activate
% pip install -r requirements.txt
...
% deactivate
This creates a new virtual environment using the Python 3.11 runtime in a
directory named venv
within the project folder. The virtual environment includes
only the libraries specified in requirements.txt
along with their dependencies.
Virtual environments are isolated, per-project environments, separate from those
of other projects and your local computer’s Python environment. Activating the
virtual environment temporarily “enters” it, while deactivating it “exits” it.
I started out with create-blob-node
, as before. Its requirements.txt
was
originally as follows:
pytz==2018.7
iso8601==0.1.12
google-cloud-storage==1.15.0
google-cloud-pubsub==0.40.0
I removed the version numbers, which causes pip
to default to the latest
available version, and then ran pip-compile
:
(create-blob-node) % pip-compile --output-file requirements.harmonized.txt requirements.txt
#
# This file is autogenerated by pip-compile with Python 3.11
# by the following command:
#
# pip-compile --output-file=requirements.harmonized.txt requirements.txt
#
cachetools==5.5.0
# via google-auth
certifi==2024.8.30
# via requests
charset-normalizer==3.4.0
# via requests
deprecated==1.2.15
# via
# opentelemetry-api
# opentelemetry-semantic-conventions
google-api-core[grpc]==2.23.0
# via
# google-cloud-core
# google-cloud-pubsub
# google-cloud-storage
google-auth==2.36.0
# via
# google-api-core
# google-cloud-core
# google-cloud-pubsub
# google-cloud-storage
google-cloud-core==2.4.1
# via google-cloud-storage
google-cloud-pubsub==2.27.1
# via -r requirements.txt
google-cloud-storage==2.19.0
# via -r requirements.txt
google-crc32c==1.6.0
# via
# google-cloud-storage
# google-resumable-media
google-resumable-media==2.7.2
# via google-cloud-storage
googleapis-common-protos[grpc]==1.66.0
# via
# google-api-core
# grpc-google-iam-v1
# grpcio-status
grpc-google-iam-v1==0.13.1
# via google-cloud-pubsub
grpcio==1.68.1
# via
# google-api-core
# google-cloud-pubsub
# googleapis-common-protos
# grpc-google-iam-v1
# grpcio-status
grpcio-status==1.68.1
# via
# google-api-core
# google-cloud-pubsub
idna==3.10
# via requests
importlib-metadata==8.5.0
# via opentelemetry-api
iso8601==2.1.0
# via -r requirements.txt
opentelemetry-api==1.28.2
# via
# google-cloud-pubsub
# opentelemetry-sdk
# opentelemetry-semantic-conventions
opentelemetry-sdk==1.28.2
# via google-cloud-pubsub
opentelemetry-semantic-conventions==0.49b2
# via opentelemetry-sdk
proto-plus==1.25.0
# via
# google-api-core
# google-cloud-pubsub
protobuf==5.29.1
# via
# google-api-core
# google-cloud-pubsub
# googleapis-common-protos
# grpc-google-iam-v1
# grpcio-status
# proto-plus
pyasn1==0.6.1
# via
# pyasn1-modules
# rsa
pyasn1-modules==0.4.1
# via google-auth
pytz==2024.2
# via -r requirements.txt
pyyaml==6.0.2
# via -r requirements.txt
requests==2.32.3
# via
# google-api-core
# google-cloud-storage
rsa==4.9
# via google-auth
typing-extensions==4.12.2
# via opentelemetry-sdk
urllib3==2.2.3
# via requests
wrapt==1.17.0
# via deprecated
zipp==3.21.0
# via importlib-metadata
As you can see, a handful of top-level requirements cascades down to quite a
number of lower-level requirements (the ones with comments like
# via -r requirements.txt
are the top-level requirements; the others are
libraries they themselves require). Installing the required packages and their
dependencies looked like this:
(create-blob-node) % pip install -r requirements.harmonized.txt
...
Installing collected packages: pytz, zipp, wrapt, urllib3, typing-extensions, pyyaml, pyasn1, protobuf, iso8601, idna, grpcio, google-crc32c, charset-normalizer, certifi, cachetools, rsa, requests, pyasn1-modules, proto-plus, importlib-metadata, googleapis-common-protos, google-resumable-media, deprecated, opentelemetry-api, grpcio-status, google-auth, opentelemetry-semantic-conventions, grpc-google-iam-v1, google-api-core, opentelemetry-sdk, google-cloud-core, google-cloud-storage, google-cloud-pubsub
Successfully installed cachetools-5.5.0 certifi-2024.8.30 charset-normalizer-3.4.0 deprecated-1.2.15 google-api-core-2.23.0 google-auth-2.36.0 google-cloud-core-2.4.1 google-cloud-pubsub-2.27.1 google-cloud-storage-2.19.0 google-crc32c-1.6.0 google-resumable-media-2.7.2 googleapis-common-protos-1.66.0 grpc-google-iam-v1-0.13.1 grpcio-1.68.1 grpcio-status-1.68.1 idna-3.10 importlib-metadata-8.5.0 iso8601-2.1.0 opentelemetry-api-1.28.2 opentelemetry-sdk-1.28.2 opentelemetry-semantic-conventions-0.49b2 proto-plus-1.25.0 protobuf-5.29.1 pyasn1-0.6.1 pyasn1-modules-0.4.1 pytz-2024.2 pyyaml-6.0.2 requests-2.32.3 rsa-4.9 typing-extensions-4.12.2 urllib3-2.2.3 wrapt-1.17.0 zipp-3.21.0
Applying the upgrades in bulk
Now that I’ve figured out how to make the changes common to all the cloud
functions using create-blob-node
, I can make those same changes to all the
cloud funtions at once, in a mass fix:
- In
cloudbuild.yaml
, replace Python 3.7 with 3.11, get rid of--no-gen2
, and change128MB
to128Mi
:find -mindepth 2 -maxdepth 2 -type f -name 'cloudbuild.yaml' |\ xargs sed -i '.bk' -e 's/python37/python311/' -e '/--no-gen2/d' -e 's/128MB/128Mi/'
- In
requirements.txt
, replace pinned versions with unpinned and appendpyyaml
:find -mindepth 2 -maxdepth 2 -type f -name 'requirements.txt' |\ xargs sed -i '.bk' -E -e 's/[<>=]+.*//' find -type f -name requirements.txt -print0 |\ xargs -0 -I {} sh -c 'echo "pyyaml" >> "{}"'
- Harmonize the libraries version requirements using
pip-compile
:% for dir in *(/); do rm $dir/*bk mv $dir/requirements.txt $dir/requirements.txt.bk pip-compile $dir/requirements.txt.bk --output-file $dir/requirements.txt done
After committing them to GitHub and letting Cloud Build build and deploy them, most of them succeeded, and at this point there are only 5 functions out of 27 not upgraded.