build:
  arguments:
    endpoint: Completions
    model: facebook/opt-125M
  model_server: VLLM
runtime:
  predict_concurrency: 128
environment_variables: {}
external_package_dirs: []
model_metadata:
  example_model_input: {"prompt": "What is the meaning of life?"}
model_name: OPT-125M vLLM
python_version: py39
requirements: []
resources:
  accelerator: T4
  cpu: "4"
  memory: 16Gi
  use_gpu: true
secrets: {}
system_packages: []

View on Github

vLLM is a Python-based package that optimizes the Attention layer in Transformer models. By better allocating memory used during the attention computation, vLLM can reduce the memory footprint of a model and significantly improve inference speed. Truss supports vLLM out of the box, so you can deploy vLLM-optimized models with ease.

config.yaml

build:
  arguments:

vLLM supports multiple types of endpoints:

Completions — Follows the same API as the OpenAI Completions API
ChatCommpletions — Follows the same API as the OpenAI ChatCompletions API

config.yaml

    endpoint: Completions

Select which vLLM-compatible model you’d like to use

config.yaml

    model: facebook/opt-125M

The model_server parameter allows you to specify TGI

config.yaml

  model_server: VLLM

Another important parameter to configure if you are choosing vLLM is the predict_concurrency. One of the main benefits of vLLM is continuous batching — in which multiple requests can be processed at the same time. Without predict_concurrency, you cannot take advantage of this feature.

config.yaml

runtime:
  predict_concurrency: 128

The remaining config options listed are standard Truss Config options.

config.yaml

environment_variables: {}
external_package_dirs: []
model_metadata:
  example_model_input: {"prompt": "What is the meaning of life?"}
model_name: OPT-125M vLLM
python_version: py39
requirements: []
resources:
  accelerator: T4
  cpu: "4"
  memory: 16Gi
  use_gpu: true
secrets: {}
system_packages: []

Deploy the model

Deploy the vLLM model like you would other Trusses, with:

$ truss push

You can then invoke the model with:

$ truss predict -d '{"prompt": "What is a large language model?", "model": "facebook/opt-125M"}' --published

build:
  arguments:
    endpoint: Completions
    model: facebook/opt-125M
  model_server: VLLM
runtime:
  predict_concurrency: 128
environment_variables: {}
external_package_dirs: []
model_metadata:
  example_model_input: {"prompt": "What is the meaning of life?"}
model_name: OPT-125M vLLM
python_version: py39
requirements: []
resources:
  accelerator: T4
  cpu: "4"
  memory: 16Gi
  use_gpu: true
secrets: {}
system_packages: []

High Performance LLM with TGI Private Hugging Face Model

⌘I

Getting started

Guides

Examples

Remotes

High Performance LLM with vLLM

View on Github

Deploy the model

Getting started

Guides

Examples

Remotes

View on Github

​Deploy the model

Deploy the model