A guide to using TGI for your model
model_id
.
endpoint
you’d like to use.
generate
which returns the entire generated response upon completiongenerate_stream
which streams the response as it’s being generatedtab
key on any of these dialogues to see options for values.
Finally, you’ll be asked for the name of your model.
target_directory
from above, you’ll find a config.yaml
file that contains a key build
. The following is a set of arguments you can pass to the build
key to tune TGI for max performance
max_input_length
(default: 1024)
This parameter represents the maximum allowed input length expressed in the number of tokens.max_total_tokens
(default: 2048)max_batch_prefill_tokens
(default: 4096)max_input_length
.
Similar to max input length, if your input tokens are constrained, this is worth setting as a function of (constrained input length) * (max batch size your hardware can handle). This setting is also worth defining when you want to impose stricter controls on the resource usage during prefill operations, especially when dealing with models having a large footprint or under constrained hardware environments.
max_batch_total_tokens
max_batch_prefill_tokens
, this represents the entire token count across a batch; the total input tokens + the total generated tokens. In short, this value should be the top end of number of tokens that can fit on the GPU after the model has been loaded.
This value is particularly important when maximizing GPU utilization. The tradeoff here is that higher values will increase throughput but will increase individual request latency.
max_waiting_tokens
sharded