What is a “cold start”?
“Cold start” is a term used to refer to the time taken by a model to boot up after being idle. This process can become a critical factor in serverless environments, as it can significantly influence the model response time, customer satisfaction, and cost.Without caching our model’s weights, we would need to download weights every time we scale up. Caching model weights circumvents this download process. When our new instance boots up, the server automatically finds the cached weights and can proceed with starting up the endpoint.In practice, this reduces the cold start for large models to just a few seconds. For example, Stable Diffusion XL can take a few minutes to boot up without caching. With caching, it takes just under 10 seconds.Enabling Caching for a Model
To enable caching, simply addmodel_cache
to your config.yml
with a valid repo_id
. The model_cache
has a few key configurations:
repo_id
(required): The endpoint for your cloud bucket. Currently, we support Hugging Face and Google Cloud Storage.revision
: Points to your revision. This is only relevant if you are pulling By default, it refers tomain
.allow_patterns
: Only cache files that match specified patterns. Utilize Unix shell-style wildcards to denote these patterns.ignore_patterns
: Conversely, you can also denote file patterns to ignore, hence streamlining the caching process.
We recently renamed
hf_cache
to model_cache
, but don’t worry! If you’re using hf_cache
in any of your projects, it will automatically be aliased to model_cache
.model_cache
for Stable Diffusion XL. Note how it only pulls the model weights that it needs using allow_patterns
.
config.yml
.bin
, .safetensors
, .h5
, .msgpack
, etc.). You only need one of these most of the time. To minimize cold starts, ensure that you only cache the weights you need.
Cache invalidation
Cached model weights are bundled at build time. Thus, the only way to re-cache weights is to trigger a rebuild by creating a new deployment withtruss push
. There is not currently a mechanism for invalidating cached model weights on an existing model.Hugging Face 🤗
For any public Hugging Face repo, you don’t need to do anything else. Adding themodel_cache
key with an appropriate repo_id
should be enough.
However, if you want to deploy a model from a gated repo like Llama 2 to Baseten, there’s a few steps you need to take:
1
Get Hugging Face API Key
Grab an API key from Hugging Face with
read
access. Make sure you have access to the model you want to serve.2
Add it to Baseten Secrets Manager
Paste your API key in your secrets manager in Baseten under the key
hf_access_token
. You can read more about secrets here.3
Update Config
In your Truss’s Make sure that the key
config.yml
, add the following code:config.yml
secrets
only shows up once in your config.yml
.Weights will be cached in the default Hugging Face cache directory,
~/.cache/huggingface/hub/models--{your_model_name}/
. You can change this directory by setting the HF_HOME
or HUGGINGFACE_HUB_CACHE
environment variable in your config.yml
.Read more here.Google Cloud Storage
Google Cloud Storage is a great alternative to Hugging Face when you have a custom model or fine-tune you want to gate, especially if you are already using GCP and care about security and compliance. Yourmodel_cache
should look something like this:
config.yml
service_account.json
and add it to the data
directory of your Truss.
Your file structure should look something like this:
If you are using version control, like git, for your Truss, make sure to add
service_account.json
to your .gitignore
file. You don’t want to accidentally expose your service account key./app/model_cache/{your_bucket_name}
.
Amazon Web Services S3
Another popular cloud storage option for hosting model weights is AWS S3, especially if you’re already using AWS services. Yourmodel_cache
should look something like this:
config.yml
aws_access_key_id
, aws_secret_access_key
, and aws_region
in your AWS dashboard. Create a file named s3_credentials.json
. Inside this file, add the credentials that you identified earlier as shown below. Place this file into the data
directory of your Truss.
The key aws_session_token
can be included, but is optional.
Here is an example of how your s3_credentials.json
file should look:
If you are using version control, like git, for your Truss, make sure to add
s3_credentials.json
to your .gitignore
file. You don’t want to accidentally expose your service account key./app/model_cache/{your_bucket_name}
.