View on Github
config.yaml
- generate: This returns the response as JSON when the full response is generated
- generate_stream: If you choose this option, results will be streamed as they are ready, using server-sent events
config.yaml
config.yaml
model_server
parameter allows you to specify a supported backend (in this example, TGI)
config.yaml
predict_concurrency
.
One of the main benefits of TGI is continuous batching — in which multiple requests can be
processed at the same time. Without predict_concurrency
set to a high enough number, you cannot take advantage of this
feature.
config.yaml
config.yaml