GCS bucket with JupyterHub on GKE

I started working with JupyterHub on Google Kubernetes Engine using zero-to-jupyterhub couple of months ago. During this time, I have not only resolved multiple issues but also have documented various troubleshooting techniques. JupyterHub has many features and zero-to-jupyterhub provides extensive configuration capabilities to use them. One of the most basic features required for the most implementations is a shared storage so that multiple users can effectively collaborate using a shared data. A traditional mechanism to create a shared storage on GKE is to create a persistent storage and configure that for the environment. However, there are many limitations with this approach. I wanted to overcome the limitations and decided to use a better alternative, Google Cloud Storage (GCS) bucket. This post describes the limitations with the traditional approach and steps to instead use an alternative of GCS bucket with JupyterHub on GKE

Prerequisites

Basic understanding of Jupyter notebook / JupyterLab
Understanding of containerized application architecture
Basic understanding of Kubernetes and Helm chart
Experience with GCP and GCP services such as GKE, GCS, etc.
Experience with JupyterHub setup on GKE using zero-to-jupyterhub Helm chart

Traditional approach and its limitations

The most common way to configure shared data store for the JupyterHub instance is to create a persistence storage. However, there are limitations with the approach

Fixed storage size: The storage size is fixed and required at the time of creation. If the shared storage ever runs out of space, you will require to recreate the storage with the new size and also migrate the old data
Restricted scaling: The approach restricts the user workloads to the persistent storage disk zone. This restricts the overall scaling, especially while using GKE Autopilot created at regional level
Static provisioning: The approach only supports static provisioning. You will need to statically provision persistent storage with a fixed size

Google Cloud Storage bucket as a shared storage

Using GCS bucket as a shared storage overcomes all the limitations listed above. That is a huge benefit for the JupyterHub setup to support large user base. GCS FUSE driver is the most common way to mount a GCS bucket as a drive. This driver is around for some time now but the biggest change recently is that Google now officially supports the driver. Google introduced GCS FUSE CSI driver for GKE and the driver is enabled by default on new version of GKE Autopilot clusters. The steps below describes changes required to use GCS bucket with JupyterHub on GKE Autopilot using zero-to-jupyterhub helm chart

1: Create a GKE Autopilot cluster with the latest version

gcloud container clusters create-auto my-autopilot-cluster \
    --location=us-central1 \
    --project=my-gcp-projectCode language: Shell Session (shell)

2: Create a GCS bucket to use as a shared drive

gcloud storage buckets create gs://my-jupyterhub-shared-data --project=my-gcp-project --location=us-central1Code language: Shell Session (shell)

3: Create a workload identity for your JupyterHub release

Create a service account under your GCP project

gcloud iam service-accounts create my-jupyterhub-release \
    --project=my-gcp-projectCode language: Shell Session (shell)

Assign storage object admin role to the service account so it can access the storage bucket from step 2 to read and write

gcloud projects add-iam-policy-binding my-gcp-project \
    --member "serviceAccount:my-jupyterhub-release@my-gcp-project.iam.gserviceaccount.com" \
    --role "roles/storage.objectAdmin"Code language: Shell Session (shell)

Link the IAM service account with the hub service account that will be created by zero-to-jupyterhub helm release

gcloud iam service-accounts add-iam-policy-binding my-jupyterhub-release@my-gcp-project.iam.gserviceaccount.com \
    --role roles/iam.workloadIdentityUser \
    --member "serviceAccount:my-gcp-project.svc.id.goog[jupyterhubnamespace/hub]"Code language: CSS (css)

4: Prepare config.yaml file for the zero-to-jupyterhub helm release

hub:
  serviceAccount:
    annotations:
      iam.gke.io/gcp-service-account:
        my-jupyterhub-release@my-gcp-project.iam.gserviceaccount.com
singleuser:
  extraAnnotations:
    gke-gcsfuse/volumes: "true"
  serviceAccountName: hub
  cloudMetadata:
    blockWithIptables: false
  networkPolicy:
    egressAllowRules:
      cloudMetadataServer: true
  storage:
    extraVolumes:
      - name: shareddata
        csi:
          driver: gcsfuse.csi.storage.gke.io
          readOnly: false
          volumeAttributes:
            bucketName: my-jupyterhub-shared-data
            mountOptions: "implicit-dirs,uid=1000,gid=100"
    extraVolumeMounts:
      - name: shareddata
        mountPath: /home/jovyan/shared
Code language: YAML (yaml)

Ensure that the name of workload identity is appropriately set from step 3 for the hub service account annotations. Also, mountOptions must include uid and gid for the write operation to work for the shared drive from user sessions

5: Install JupterHub release on the GKE cluster my-autopilot-cluster using the config.yaml. Use the latest version of zero-to-jupyterhub helm chart and ensure that the namespace parameter is appropriate, jupyterhubnamespace in this example here

6: After the successful installation of JupyterHub on your GKE Autopilot, launch a user session. You should see shared folder under the user home folder

The steps here show the configuration to mount GCS bucket as a shared drive dynamically for each user session. You can also statically mount GCS bucket as a shared drive by creating a static pvc pointing to the GCS bucket

Prerequisites

Traditional approach and its limitations

Google Cloud Storage bucket as a shared storage

Related posts: