Deploy the BERT model for text classification task with Hugging Face LLM Serving Runtime¶

In this example, We demonstrate how to deploy distilBERT model for sequence classification (a.k.a. text classification) task from Hugging Face by deploying the InferenceService with Hugging Face Serving runtime.

Serve the Hugging Face LLM model using V1 Protocol¶

First, We will deploy the distilBERT model using the Hugging Face backend with V1 Protocol.

Yaml

kubectl apply -f - <<EOF
apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
metadata:
  name: huggingface-distilbert
spec:
  predictor:
    model:
      modelFormat:
        name: huggingface
      args:
        - --model_name=distilbert
        - --model_id=distilbert/distilbert-base-uncased-finetuned-sst-2-english
      resources:
        limits:
          cpu: "1"
          memory: 4Gi
          nvidia.com/gpu: "1"
        requests:
          cpu: "1"
          memory: 2Gi
          nvidia.com/gpu: "1"
EOF

Check `InferenceService` status.¶

kubectl get inferenceservices huggingface-distilbert

Expected Output

NAME                     URL                                                   READY   PREV   LATEST   PREVROLLEDOUTREVISION   LATESTREADYREVISION                          AGE
huggingface-distilbert   http://huggingface-distilbert.default.example.com             True           100                              huggingface-distilbert-predictor-default-47q2g   7d23h

Perform Model Inference¶

The first step is to determine the ingress IP and ports and set INGRESS_HOST and INGRESS_PORT.

MODEL_NAME=distilbert
SERVICE_HOSTNAME=$(kubectl get inferenceservice huggingface-distilbert -o jsonpath='{.status.url}' | cut -d "/" -f 3)

curl -v http://${INGRESS_HOST}:${INGRESS_PORT}/v1/models/${MODEL_NAME}:predict \
-H "content-type: application/json" -H "Host: ${SERVICE_HOSTNAME}" \
-d '{"instances": ["Hello, my dog is cute", "I am feeling sad"]}'

Expected Output

{"predictions":[1,0]}

Serve the Hugging Face LLM model using Open Inference Protocol(V2 Protocol)¶

First, We will deploy the distilBERT model using the Hugging Face backend with Open Inference Protocol(V2 Protocol). For this, We need to set the protocolVersion field to v2.

Yaml

kubectl apply -f - <<EOF
apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
metadata:
  name: huggingface-distilbert
spec:
  predictor:
    model:
      modelFormat:
        name: huggingface
      protocolVersion: v2
      args:
        - --model_name=distilbert
        - --model_id=distilbert/distilbert-base-uncased-finetuned-sst-2-english
      resources:
        limits:
          cpu: "1"
          memory: 4Gi
          nvidia.com/gpu: "1"
        requests:
          cpu: "1"
          memory: 2Gi
          nvidia.com/gpu: "1"
EOF

Check `InferenceService` status.¶

kubectl get inferenceservices huggingface-distilbert

Expected Output

NAME                     URL                                                   READY   PREV   LATEST   PREVROLLEDOUTREVISION   LATESTREADYREVISION                          AGE
huggingface-distilbert   http://huggingface-distilbert.default.example.com             True           100                              huggingface-distilbert-predictor-default-47q2g   7d23h

Perform Model Inference¶

The first step is to determine the ingress IP and ports and set INGRESS_HOST and INGRESS_PORT.

MODEL_NAME=distilbert
SERVICE_HOSTNAME=$(kubectl get inferenceservice huggingface-distilbert -o jsonpath='{.status.url}' | cut -d "/" -f 3)

curl -v http://${INGRESS_HOST}:${INGRESS_PORT}/v2/models/${MODEL_NAME}/infer \
-H "content-type: application/json" -H "Host: ${SERVICE_HOSTNAME}" \
-d '{"inputs": [{"name": "input-0", "shape": [2], "datatype": "BYTES", "data": ["Hello, my dog is cute", "I am feeling sad"]}]}'

Expected Output

{
  "model_name": "distilbert",
  "model_version": null,
  "id": "e4bcfc28-e9f2-4c2a-b61f-c491e7346528",
  "parameters": null,
  "outputs": [
    {
      "name": "output-0",
      "shape": [2],
      "datatype": "INT64",
      "parameters": null,
      "data": [1, 0]
    }
  ]
}

Deploy the BERT model for text classification task with Hugging Face LLM Serving Runtime¶

Serve the Hugging Face LLM model using V1 Protocol¶

Check InferenceService status.¶

Perform Model Inference¶

Serve the Hugging Face LLM model using Open Inference Protocol(V2 Protocol)¶

Check InferenceService status.¶

Perform Model Inference¶

Check `InferenceService` status.¶

Check `InferenceService` status.¶