High Performance LLM with TGI
Deploy a language model with TGI
View on Github
TGI is a model server optimized for language models. In this example, we put together a Truss that serves the model Falcon 7B using TGI.
For Trusses that use TGI, there is no user code to define, so there is only a config.yaml file. You can run any model that supports TGI.
The endpoint argument has two options:
- generate: This returns the response as JSON when the full response is generated
- generate_stream: If you choose this option, results will be streamed as they are ready, using server-sent events
Select the model that you’d like to use with TGI
The model_server
parameter allows you to specify a supported backend (in this example, TGI)
Another important parameter to configure if you are choosing TGI is the predict_concurrency
.
One of the main benefits of TGI is continuous batching — in which multiple requests can be
processed at the same time. Without predict_concurrency
set to a high enough number, you cannot take advantage of this
feature.
The remaining config options listed are standard Truss Config options.
Deploy the model
Deploy the TGI model like you would other Trusses, with:
You can then invoke the model with: