Hugging Face LLM Serving Runtime¶

The Hugging Face serving runtime implements two backends namely Hugging Face and vLLM that can serve Hugging Face models out of the box. The preprocess and post-process handlers are already implemented based on different ML tasks, for example text classification, token-classification, text-generation, text2text-generation, fill-mask.

KServe Hugging Face runtime by default uses vLLM backend to serve text generation and text2text generation LLM models for faster time-to-first-token(TTFT) and higher token generation throughput than the Hugging Face API. vLLM is implemented with common inference optimization techniques, such as paged attention, continuous batching and an optimized CUDA kernel. If the Model is not supported by the vLLM engine, KServe falls back to the Hugging Face backend as a failsafe.

Supported ML Tasks¶

The Hugging Face runtime supports the following ML tasks:

Text Generation
Text2Text Generation
Fill Mask
Token Classification
Sequence Classification (Text Classification)

For, Models supported by the vllm backend, Please visit the vLLM Supported Models page.

API Endpoints¶

Both the backends supports serving generative models (text generation and text2text generation) using OpenAI's Completion and Chat Completion API.

The other types of tasks like token classification, sequence classification, fill mask are served using KServe's Open Inference Protocol or V1 API.

Examples¶

The following examples demonstrate how to deploy and perform inference using the Hugging Face runtime with different ML tasks:

Note

The Hugging Face runtime image has the following environment variables set by default:

SAFETENSORS_FAST_GPU is set by default to improve the model loading performance.
HF_HUB_DISABLE_TELEMETRY is set by default to disable the telemetry.

Hugging Face Runtime Arguments¶

Below, you can find an explanation of command line arguments which are supported for Hugging Face runtime. vLLM backend engine arguments can also be specified on the command line argument which is parsed by the Hugging Face runtime.

--model_name: The name of the model used on the endpoint path.
--model_dir: The local path where the model is downloaded to. If model_id is provided, this argument will be ignored.
--model_id: Huggingface model id.
--model_revision: Huggingface model revision.
--tokenizer_revision: Huggingface tokenizer revision.
--dtype: Data type to load the weights in. One of 'auto', 'float16', 'float32', 'bfloat16', 'float', 'half'. Defaults to float16 for GPU and float32 for CPU systems. 'auto' uses float16 if GPU is available and uses float32 otherwise to ensure consistency between vLLM and HuggingFace backends. Encoder models defaults to 'float32'. 'float' is shorthand for 'float32'. 'half' is 'float16'. The rest are as the name reads.
--task: The ML task name. Can be one of 'text_generation', 'text2text_generation', 'fill_mask', 'token_classification', 'sequence_classification'. If not provided, model server will try infer the task from model architecture.
--backend: The backend to use to load the model. Can be one of 'auto', 'huggingface', 'vllm'.
--max_length: Max sequence length for the tokenizer.
--disable_lower_case: Disable lower case for the tokenizer.
--disable_special_tokens: The sequences will not be encoded with the special tokens relative to the model.
--trust_remote_code: Allow loading of models and tokenizers with custom code.
--tensor_input_names: The tensor input names passed to the model for triton inference server backend.
--return_token_type_ids: Return token type ids.
--return_probabilities: Return probabilities of predicted indexes. This is only applicable for tasks 'sequence_classification', 'token_classification' and 'fill_mask'.