In this example, we go through a Truss that serves an LLM, and caches the weights
at build time. Loading model weights for any model can often be the most time-consuming
part of starting a model. Caching the weights at build time means that the weights
will be baked into the Truss image, and will be available immediately when your model
replica starts. This means that cold starts will be significantly faster with this approach.
The config.yaml file is where you need to include the changes to
actually cache the weights at build time.
config.yaml
environment_variables:{}external_package_dirs:[]model_metadata:example_model_input:{"prompt":"What is the meaning of life?"}model_name: Llama with Cached Weightspython_version: py39requirements:- accelerate==0.21.0- safetensors==0.3.2- torch==2.0.1- transformers==4.34.0- sentencepiece==0.1.99- protobuf==4.24.4
To cache model weights, set the model_cache key.
The repo_id field allows you to specify a Huggingface
repo to pull down and cache at build-time, and the ignore_patterns
field allows you to specify files to ignore. If this is specified, then
this repo won’t have to be pulled during runtime.
Deploy the model like you would other Trusses, with:
$ truss push
The build step will take longer than with the normal
Llama Truss, since bundling the model weights is now happening during the build.
The deploy step & scale-ups will happen much faster with this approach.
You can then invoke the model with:
$ truss predict -d'{"inputs": "What is a large language model?"}'