Announcing: KServe v0.11¶
We are excited to announce the release of KServe 0.11, in this release we introduced Large Language Model (LLM) runtimes, made enhancements to the KServe control plane, Python SDK Open Inference Protocol support and dependency managemenet. For ModelMesh we have added features PVC, HPA, payload logging to ensure feature parity with KServe.
Here is a summary of the key changes:
KServe Core Inference Enhancements¶
-
Support path based routing which is served as an alternative way to the host based routing, the URL of the
InferenceService
could look likehttp://<ingress_domain>/serving/<namespace>/<isvc_name>
. Please refer to the doc for how to enable path based routing. -
Introduced priority field for
Serving Runtime
custom resource to handle the case when you have multiple serving runtimes which support the same model formats, see more details from the serving runtime doc. -
Introduced Custom Storage Container CRD to allow customized implementations with supported storage URI prefixes, example use cases are private model registry integration:
apiVersion: "serving.kserve.io/v1alpha1" kind: ClusterStorageContainer metadata: name: default spec: container: name: storage-initializer image: kserve/model-registry:latest resources: requests: memory: 100Mi cpu: 100m limits: memory: 1Gi cpu: "1" supportedUriFormats: - prefix: model-registry://
-
Inference Graph enhancements for improving the API spec to support pod affinity and resource requirement fields.
Dependency
field with optionsSoft
andHard
is introduced to handle error responses from the inference steps to decide whether to short-circuit the request in case of errors, see the following example with hard dependency with the node steps:
apiVersion: serving.kserve.io/v1alpha1
kind: InferenceGraph
metadata:
name: graph_with_switch_node
spec:
nodes:
root:
routerType: Sequence
steps:
- name: "rootStep1"
nodeName: node1
dependency: Hard
- name: "rootStep2"
serviceName: {{ success_200_isvc_id }}
node1:
routerType: Switch
steps:
- name: "node1Step1"
serviceName: {{ error_404_isvc_id }}
condition: "[@this].#(decision_picker==ERROR)"
dependency: Hard
- Improved InferenceService debugging experience by adding the aggregated
RoutesReady
status andLastDeploymentReady
condition to the InferenceService Status to differentiate the endpoint and deployment status. This applies to the serverless mode and for more details refer to the API docs.
Enhanced Python SDK Dependency Management¶
-
KServe has adopted poetry to manage python dependencies. You can now install the KServe SDK with locked dependencies using
poetry install
. Whilepip install
still works, we highly recommend using poetry to ensure predictable dependency management. -
The KServe SDK is also slimmed down by making the cloud storage dependency optional, if you require storage dependency for custom serving runtimes you can still install with
pip install kserve[storage]
.
KServe Python Runtimes Improvements¶
-
KServe Python Runtimes including sklearnserver, lgbserver, xgbserver now support the open inference protocol for both REST and gRPC.
-
Logging improvements including adding Uvicorn access logging and a default KServe logger.
-
Postprocess
handler has been aligned with open inference protocol, simplifying the underlying transportation protocol complexities.
LLM Runtimes¶
TorchServe LLM Runtime¶
KServe now integrates with TorchServe 0.8, offering the support for LLM models that may not fit onto a single GPU. Huggingface Accelerate and Deepspeed are available options to split the model into multiple partitions over multiple GPUs. You can see the detailed example for how to serve the LLM on KServe with TorchServe runtime.
vLLM Runtime¶
Serving LLM models can be surprisingly slow even on high end GPUs, vLLM is a fast and easy-to-use LLM inference engine. It can achieve 10x-20x higher throughput than Huggingface transformers. It supports continuous batching for increased throughput and GPU utilization, paged attention to address the memory bottleneck where in the autoregressive decoding process all the attention key value tensors(KV Cache) are kept in the GPU memory to generate next tokens.
In the example we show how to deploy vLLM on KServe and expects further integration in KServe 0.12 with proposed generate endpoint for open inference protocol.
ModelMesh Updates¶
Storing Models on Kubernetes Persistent Volumes (PVC)¶
ModelMesh now allows to directly mount model files onto serving runtimes pods using Kubernetes Persistent Volumes. Depending on the selected storage solution this approach can significantly reduce latency when deploying new predictors, potentially remove the need for additional S3 cloud object storage like AWS S3, GCS, or Azure Blob Storage altogether.
Horizontal Pod Autoscaling (HPA)¶
Kubernetes Horizontal Pod Autoscaling can now be used at the serving runtime pod level. With HPA enabled, the ModelMesh controller no longer manages the number of replicas. Instead, a HorizontalPodAutoscaler
automatically updates the serving
runtime deployment with the number of Pods to best match the demand.
Model Metrics, Metrics Dashboard, Payload Event Logging¶
ModelMesh v0.11 introduces a new configuration option to emit a subset of useful metrics at the individual model level. These metrics can help identify outlier or "heavy hitter" models and consequently fine-tune the deployments of those inference services, like allocating more resources or increasing the number of replicas for improved responsiveness or avoid frequent cache misses.
A new Grafana dashboard was added to display the comprehensive set of Prometheus metrics like model loading and unloading rates, internal queuing delays, capacity and usage, cache state, etc. to monitor the general health of the ModelMesh Serving deployment.
The new PayloadProcessor
interface can be implemented to log prediction requests and responses, to create data sinks for data visualization, for model quality assessment, or for drift and outlier detection by external monitoring systems.
What's Changed? ¶
-
To allow longer InferenceService name due to DNS max length limits from issue, the
Default
suffix in the inference service component(predictor/transformer/explainer) name has been removed for newly created InferenceServices. This affects the client that is using the component url directly instead of the top level InferenceService url. -
Status.address.url is now consistent for both serverless and raw deployment mode, the url path portion is dropped in serverless mode.
-
Raw bytes are now accepted in v1 protocol, setting the right content-type header to
application/json
is required to recognize and decode the json payload ifcontent-type
is specified.curl -v -H "Content-Type: application/json" http://sklearn-iris.kserve-test.${CUSTOM_DOMAIN}/v1/models/sklearn-iris:predict -d @./iris-input.json
For a complete change list please read the release notes from KServe v0.11 and ModelMesh v0.11.
Join the community¶
- Visit our Website or GitHub
- Join the Slack (#kserve)
- Attend our community meeting by subscribing to the KServe calendar.
- View our community github repository to learn how to make contributions. We are excited to work with you to make KServe better and promote its adoption!
Thanks for all the contributors who have made the commits to 0.11 release!
The KServe Working Group