High Performance LLM with vLLM
Deploy a language model with vLLM
View on Github
vLLM is a Python-based package that optimizes the Attention layer in Transformer models. By better allocating memory used during the attention computation, vLLM can reduce the memory footprint of a model and significantly improve inference speed. Truss supports vLLM out of the box, so you can deploy vLLM-optimized models with ease.
vLLM supports multiple types of endpoints:
- Completions — Follows the same API as the OpenAI Completions API
- ChatCommpletions — Follows the same API as the OpenAI ChatCompletions API
Select which vLLM-compatible model you’d like to use
The model_server
parameter allows you to specify TGI
Another important parameter to configure if you are choosing vLLM is the predict_concurrency
.
One of the main benefits of vLLM is continuous batching — in which multiple requests can be
processed at the same time. Without predict_concurrency, you cannot take advantage of this
feature.
The remaining config options listed are standard Truss Config options.
Deploy the model
Deploy the vLLM model like you would other Trusses, with:
You can then invoke the model with: