Use cases
Simplify Generative AI Model Development on Kubernetes with Datashim
[!NOTE]
This tutorial is part of a Medium article that you can find on Medium
(Note: Credit to YAML file from @zioproto for TGI deployment in Kubernetes which provided the basis for the TGI deployment shown in this example)
Prerequisites
Please read the Medium article we have written on Medium understand the context of this tutorial.
Other than that, there are no prerequisites needed to follow this tutorial, as it will provide instructions to provision a local S3 endpoint and store two models in it. If you already have them, feel free to skip the optional instructions, but make sure to update the values in the YAMLs, as they will all reference the setup we provide.
(OPTIONAL) Creating a local object storage endpoint
The YAML we provide provisions a local MinIO instance using hardcoded credentials.
[!CAUTION] Do not use this for any real production workloads!
From this folder, simply run:
kubectl create namespace minio
kubectl apply -f minio.yaml
kubectl wait pod --for=condition=Ready -n minio --timeout=-1s minio
Creating the staging and production namespaces
Let us start by creating the staging and production namespaces:
kubectl create namespace production
kubectl create namespace staging
To use Datashim's functionalities, we must also label them with
monitor-pods-datasets=enabled
so that Datashim can mount volumes in the pods:
kubectl label namespace production monitor-pods-datasets=enabled
kubectl label namespace staging monitor-pods-datasets=enabled
Creating the Datasets
To access our data, we must first create a Secret
containing the credentials
to access the bucket that holds our data, and then a Dataset
object that links
configuration information to the access credentials.
Run
kubectl apply -f s3-secret-prod.yaml
kubectl apply -f dataset-prod.yaml
kubectl apply -f s3-secret-staging.yaml
kubectl apply -f dataset-staging.yaml
To create secrets holding the access information to our local S3 endpoint and the related Datasets you can see in the "A use case: model development on Kubernetes" section of the article.
(OPTIONAL) Adding models in the object storage
In this tutorial we simulate a development team working with two different models: FLAN-T5-Small will be our "production" model, while the bigger and improved FLAN-T5-Base will be our "staging" model. To load them in our MinIO instance we can run:
kubectl apply -f download-flan-t5-small-to-minio-prod.yaml
kubectl wait -n production --for=condition=complete job/download-flan --timeout=-1s
kubectl apply -f download-flan-t5-base-to-minio-staging.yaml
kubectl wait -n staging --for=condition=complete job/download-flan --timeout=-1s
This will create two Jobs that will download the appropriate model for each namespace, and wait for their completion. This may take several minutes.
[!NOTE]
Using git to clone directly in/mnt/datasets/model-weights/my-model/
would fail on OpenShift due to the default security policies. Errors such ascp: can't preserve permissions
you might see in the pod logs can be safely ignored.
Creating the TGI deployments
As we mention in the article, we can now use the same Deployment file to serve the model in both namespaces. Run:
kubectl apply -n production -f deployment.yaml
kubectl apply -n staging -f deployment.yaml
To create the TGI deployments. We can wait for TGI to be ready using the command:
kubectl wait pod -n production -l run=text-generation-inference --for=condition=Ready --timeout=-1s
kubectl wait pod -n staging -l run=text-generation-inference --for=condition=Ready --timeout=-1s
Creating the TGI service
We can now create a service in both namespaces as such:
kubectl apply -n production -f service.yaml
kubectl apply -n staging -f service.yaml
Validating the deployment
We can now forward the service exposing TGI as such:
kubectl port-forward -n production --address localhost svc/text-generation-inference 8888:8080 &
kubectl port-forward -n staging --address localhost svc/text-generation-inference 8889:8080 &
And run an inference request against it with:
curl -s http://localhost:8888/generate -X POST -d '{"inputs":"The square root of x is the cube root of y. What is y to the power of 2, if x = 4?", "parameters":{"max_new_tokens":1000}}' -H 'Content-Type: application/json' | jq -r .generated_text
curl -s http://localhost:8889/generate -X POST -d '{"inputs":"The square root of x is the cube root of y. What is y to the power of 2, if x = 4?", "parameters":{"max_new_tokens":1000}}' -H 'Content-Type: application/json' | jq -r .generated_text
The flan-t5-small should be very fast and reply with:
0
flan-t5-base will instead take a while to reply with:
x = 4 * 2 = 8 x = 16 y = 16 to the power of 2
Next Steps
-
Get started with Datashim
Check out our User Guide and get up and running in minutes
-
Any questions?
Find answers to frequently asked questions in our FAQ