Deploy the BERT model for token classification task with Hugging Face LLM Serving Runtime¶
In this example, We demonstrate how to deploy BERT model
for token classification task from Hugging Face by deploying the InferenceService
with Hugging Face Serving runtime.
Serve the Hugging Face LLM model using V1 Protocol¶
First, We will deploy the BERT model
using the Hugging Face backend with V1 Protocol.
kubectl apply -f - <<EOF
apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
metadata:
name: huggingface-bert
spec:
predictor:
model:
modelFormat:
name: huggingface
args:
- --model_name=bert
- --model_id=dslim/bert-base-NER
- --disable_lower_case
resources:
limits:
cpu: "1"
memory: 2Gi
nvidia.com/gpu: "1"
requests:
cpu: "1"
memory: 2Gi
nvidia.com/gpu: "1"
EOF
Check InferenceService
status.¶
kubectl get inferenceservices huggingface-bert
Expected Output
NAME URL READY PREV LATEST PREVROLLEDOUTREVISION LATESTREADYREVISION AGE
huggingface-bert http://huggingface-bert.default.example.com True 100 huggingface-bert-predictor-default-47q2g 7d23h
Perform Model Inference¶
The first step is to determine the ingress IP and ports and set INGRESS_HOST
and INGRESS_PORT
.
MODEL_NAME=bert
SERVICE_HOSTNAME=$(kubectl get inferenceservice huggingface-bert -o jsonpath='{.status.url}' | cut -d "/" -f 3)
curl -v http://${INGRESS_HOST}:${INGRESS_PORT}/v1/models/${MODEL_NAME}:predict \
-H "content-type: application/json" -H "Host: ${SERVICE_HOSTNAME}" \
-d '{"instances": ["My name is Wolfgang and I live in Berlin", "My name is Lisa and I live in Paris"]}'
Expected Output
{"predictions":[[[0,0,0,0,3,0,0,0,0,7,0]],[[0,0,0,0,3,0,0,0,0,7,0]]]}
Serve the Hugging Face LLM model using Open Inference Protocol(V2 Protocol)¶
First, We will deploy the BERT model
using the Hugging Face backend with Open Inference Protocol(V2 Protocol).
For this, We need to set the protocolVersion
field to v2
.
kubectl apply -f - <<EOF
apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
metadata:
name: huggingface-bert
spec:
predictor:
model:
modelFormat:
name: huggingface
protocolVersion: v2
args:
- --model_name=bert
- --model_id=dslim/bert-base-NER
- --disable_lower_case
resources:
limits:
cpu: "1"
memory: 2Gi
nvidia.com/gpu: "1"
requests:
cpu: "1"
memory: 2Gi
nvidia.com/gpu: "1"
EOF
Check InferenceService
status.¶
kubectl get inferenceservices huggingface-bert
Expected Output
NAME URL READY PREV LATEST PREVROLLEDOUTREVISION LATESTREADYREVISION AGE
huggingface-bert http://huggingface-bert.default.example.com True 100 huggingface-bert-predictor-default-47q2g 7d23h
Perform Model Inference¶
The first step is to determine the ingress IP and ports and set INGRESS_HOST
and INGRESS_PORT
.
MODEL_NAME=bert
SERVICE_HOSTNAME=$(kubectl get inferenceservice huggingface-bert -o jsonpath='{.status.url}' | cut -d "/" -f 3)
curl -v http://${INGRESS_HOST}:${INGRESS_PORT}/v2/models/${MODEL_NAME}/infer \
-H "content-type: application/json" -H "Host: ${SERVICE_HOSTNAME}" \
-d '{"inputs": [{"name": "input-0", "shape": [2], "datatype": "BYTES", "data": ["My name is Wolfgang and I live in Berlin", "My name is Lisa and I live in Paris"]}]}'
Expected Output
{
"model_name": "bert",
"model_version": null,
"id": "3117e54b-8a6a-4072-9d87-6d7bdfe05eed",
"parameters": null,
"outputs": [
{
"name": "output-0",
"shape": [2,1,11],
"datatype": "INT64",
"parameters": null,
"data":[0,0,0,0,3,0,0,0,0,7,0,0,0,0,0,3,0,0,0,0,7,0]
}
]
}