build:
  arguments:
    endpoint: Completions
    model: facebook/opt-125M
  model_server: VLLM
runtime:
  predict_concurrency: 128
environment_variables: {}
external_package_dirs: []
model_metadata:
  example_model_input: {"prompt": "What is the meaning of life?"}
model_name: OPT-125M vLLM
python_version: py39
requirements: []
resources:
  accelerator: T4
  cpu: "4"
  memory: 16Gi
  use_gpu: true
secrets: {}
system_packages: []

View on Github

vLLM is a Python-based package that optimizes the Attention layer in Transformer models. By better allocating memory used during the attention computation, vLLM can reduce the memory footprint of a model and significantly improve inference speed. Truss supports vLLM out of the box, so you can deploy vLLM-optimized models with ease.

config.yaml
build:
  arguments:

vLLM supports multiple types of endpoints:

config.yaml
    endpoint: Completions

Select which vLLM-compatible model you’d like to use

config.yaml
    model: facebook/opt-125M

The model_server parameter allows you to specify TGI

config.yaml
  model_server: VLLM

Another important parameter to configure if you are choosing vLLM is the predict_concurrency. One of the main benefits of vLLM is continuous batching — in which multiple requests can be processed at the same time. Without predict_concurrency, you cannot take advantage of this feature.

config.yaml
runtime:
  predict_concurrency: 128

The remaining config options listed are standard Truss Config options.

config.yaml
environment_variables: {}
external_package_dirs: []
model_metadata:
  example_model_input: {"prompt": "What is the meaning of life?"}
model_name: OPT-125M vLLM
python_version: py39
requirements: []
resources:
  accelerator: T4
  cpu: "4"
  memory: 16Gi
  use_gpu: true
secrets: {}
system_packages: []

Deploy the model

Deploy the vLLM model like you would other Trusses, with:

$ truss push

You can then invoke the model with:

$ truss predict -d '{"prompt": "What is a large language model?", "model": "facebook/opt-125M"}' --published
build:
  arguments:
    endpoint: Completions
    model: facebook/opt-125M
  model_server: VLLM
runtime:
  predict_concurrency: 128
environment_variables: {}
external_package_dirs: []
model_metadata:
  example_model_input: {"prompt": "What is the meaning of life?"}
model_name: OPT-125M vLLM
python_version: py39
requirements: []
resources:
  accelerator: T4
  cpu: "4"
  memory: 16Gi
  use_gpu: true
secrets: {}
system_packages: []