What is TGI?
TGI consists of 2 parts:- A high performance, Rust-based server
- A set of optimized model implementations that outperform generic implementations
- Mistral 7B
- Llama V2
- Llama
- MPT
- Code Llama
- Falcon 40B
- Falcon 7B
- FLAN-T5
- BLOOM
- Galactica
- GPT-Neox
- OPT
- SantaCoder
- Starcoder
How to use TGI
To define a TGI truss, we’ll use the truss CLI to generate the scaffold. Run the following in your terminalmodel_id
.
endpoint
you’d like to use.
generate
which returns the entire generated response upon completiongenerate_stream
which streams the response as it’s being generated
tab
key on any of these dialogues to see options for values.
Finally, you’ll be asked for the name of your model.
Deploying your TGI model
Now that we have a TGI model, let’s deploy this model to Baseten and see how it performs. You’ll need an API key to deploy your model. You can get one by navigating to your Baseten settings page. To push the model to Baseten, run the following command:Tuning your TGI server
After deploying your model, you may notice that you’re not getting the performance you’d like out of TGI. If you navigate to thetarget_directory
from above, you’ll find a config.yaml
file that contains a key build
. The following is a set of arguments you can pass to the build
key to tune TGI for max performance
max_input_length
(default: 1024) This parameter represents the maximum allowed input length expressed in the number of tokens.
max_total_tokens
(default: 2048)
max_batch_prefill_tokens
(default: 4096)
max_input_length
.
Similar to max input length, if your input tokens are constrained, this is worth setting as a function of (constrained input length) * (max batch size your hardware can handle). This setting is also worth defining when you want to impose stricter controls on the resource usage during prefill operations, especially when dealing with models having a large footprint or under constrained hardware environments.
max_batch_total_tokens
max_batch_prefill_tokens
, this represents the entire token count across a batch; the total input tokens + the total generated tokens. In short, this value should be the top end of number of tokens that can fit on the GPU after the model has been loaded.
This value is particularly important when maximizing GPU utilization. The tradeoff here is that higher values will increase throughput but will increase individual request latency.
max_waiting_tokens
sharded