Deploy the BERT model for text classification task with Hugging Face LLM Serving Runtime¶
In this example, We demonstrate how to deploy distilBERT model
for sequence classification (a.k.a. text classification) task from Hugging Face by deploying the InferenceService
with Hugging Face Serving runtime.
Serve the Hugging Face LLM model using V1 Protocol¶
First, We will deploy the distilBERT model
using the Hugging Face backend with V1 Protocol.
kubectl apply -f - <<EOF
apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
metadata:
name: huggingface-distilbert
spec:
predictor:
model:
modelFormat:
name: huggingface
args:
- --model_name=distilbert
- --model_id=distilbert/distilbert-base-uncased-finetuned-sst-2-english
resources:
limits:
cpu: "1"
memory: 4Gi
nvidia.com/gpu: "1"
requests:
cpu: "1"
memory: 2Gi
nvidia.com/gpu: "1"
EOF
Check InferenceService
status.¶
kubectl get inferenceservices huggingface-distilbert
Expected Output
NAME URL READY PREV LATEST PREVROLLEDOUTREVISION LATESTREADYREVISION AGE
huggingface-distilbert http://huggingface-distilbert.default.example.com True 100 huggingface-distilbert-predictor-default-47q2g 7d23h
Perform Model Inference¶
The first step is to determine the ingress IP and ports and set INGRESS_HOST
and INGRESS_PORT
.
MODEL_NAME=distilbert
SERVICE_HOSTNAME=$(kubectl get inferenceservice huggingface-distilbert -o jsonpath='{.status.url}' | cut -d "/" -f 3)
curl -v http://${INGRESS_HOST}:${INGRESS_PORT}/v1/models/${MODEL_NAME}:predict \
-H "content-type: application/json" -H "Host: ${SERVICE_HOSTNAME}" \
-d '{"instances": ["Hello, my dog is cute", "I am feeling sad"]}'
Expected Output
{"predictions":[1,0]}
Serve the Hugging Face LLM model using Open Inference Protocol(V2 Protocol)¶
First, We will deploy the distilBERT model
using the Hugging Face backend with Open Inference Protocol(V2 Protocol).
For this, We need to set the protocolVersion
field to v2
.
kubectl apply -f - <<EOF
apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
metadata:
name: huggingface-distilbert
spec:
predictor:
model:
modelFormat:
name: huggingface
protocolVersion: v2
args:
- --model_name=distilbert
- --model_id=distilbert/distilbert-base-uncased-finetuned-sst-2-english
resources:
limits:
cpu: "1"
memory: 4Gi
nvidia.com/gpu: "1"
requests:
cpu: "1"
memory: 2Gi
nvidia.com/gpu: "1"
EOF
Check InferenceService
status.¶
kubectl get inferenceservices huggingface-distilbert
Expected Output
NAME URL READY PREV LATEST PREVROLLEDOUTREVISION LATESTREADYREVISION AGE
huggingface-distilbert http://huggingface-distilbert.default.example.com True 100 huggingface-distilbert-predictor-default-47q2g 7d23h
Perform Model Inference¶
The first step is to determine the ingress IP and ports and set INGRESS_HOST
and INGRESS_PORT
.
MODEL_NAME=distilbert
SERVICE_HOSTNAME=$(kubectl get inferenceservice huggingface-distilbert -o jsonpath='{.status.url}' | cut -d "/" -f 3)
curl -v http://${INGRESS_HOST}:${INGRESS_PORT}/v2/models/${MODEL_NAME}/infer \
-H "content-type: application/json" -H "Host: ${SERVICE_HOSTNAME}" \
-d '{"inputs": [{"name": "input-0", "shape": [2], "datatype": "BYTES", "data": ["Hello, my dog is cute", "I am feeling sad"]}]}'
Expected Output
{
"model_name": "distilbert",
"model_version": null,
"id": "e4bcfc28-e9f2-4c2a-b61f-c491e7346528",
"parameters": null,
"outputs": [
{
"name": "output-0",
"shape": [2],
"datatype": "INT64",
"parameters": null,
"data": [1, 0]
}
]
}