How to cache model weights

Truss natively supports automatic caching for model weights. This is a simple yet effective strategy to enhance deployment speed and operational efficiency when it comes to cold starts and scaling beyond a single replica.

What is a “cold start”?

“Cold start” is a term used to refer to the time taken by a model to boot up after being idle. This process can become a critical factor in serverless environments, as it can significantly influence the model response time, customer satisfaction, and cost.Without caching our model’s weights, we would need to download weights every time we scale up. Caching model weights circumvents this download process. When our new instance boots up, the server automatically finds the cached weights and can proceed with starting up the endpoint.In practice, this reduces the cold start for large models to just a few seconds. For example, Stable Diffusion XL can take a few minutes to boot up without caching. With caching, it takes just under 10 seconds.

Enabling Caching for a Model

To enable caching, simply add model_cache to your config.yml with a valid repo_id. The model_cache has a few key configurations:

repo_id (required): The endpoint for your cloud bucket. Currently, we support Hugging Face and Google Cloud Storage.
revision: Points to your revision. This is only relevant if you are pulling By default, it refers to main.
allow_patterns: Only cache files that match specified patterns. Utilize Unix shell-style wildcards to denote these patterns.
ignore_patterns: Conversely, you can also denote file patterns to ignore, hence streamlining the caching process.

We recently renamed hf_cache to model_cache, but don’t worry! If you’re using hf_cache in any of your projects, it will automatically be aliased to model_cache.

Here is an example of a well written model_cache for Stable Diffusion XL. Note how it only pulls the model weights that it needs using allow_patterns.

config.yml

...
model_cache:
  - repo_id: madebyollin/sdxl-vae-fp16-fix
    allow_patterns:
      - config.json
      - diffusion_pytorch_model.safetensors
  - repo_id: stabilityai/stable-diffusion-xl-base-1.0
    allow_patterns:
      - "*.json"
      - "*.fp16.safetensors"
      - sd_xl_base_1.0.safetensors
  - repo_id: stabilityai/stable-diffusion-xl-refiner-1.0
    allow_patterns:
      - "*.json"
      - "*.fp16.safetensors"
      - sd_xl_refiner_1.0.safetensors
...

Many Hugging Face repos have model weights in different formats (.bin, .safetensors, .h5, .msgpack, etc.). You only need one of these most of the time. To minimize cold starts, ensure that you only cache the weights you need.

Cache invalidation

Cached model weights are bundled at build time. Thus, the only way to re-cache weights is to trigger a rebuild by creating a new deployment with truss push. There is not currently a mechanism for invalidating cached model weights on an existing model.

There are also some additional steps depending on the cloud bucket you want to query.

Hugging Face 🤗

For any public Hugging Face repo, you don’t need to do anything else. Adding the model_cache key with an appropriate repo_id should be enough. However, if you want to deploy a model from a gated repo like Llama 2 to Baseten, there’s a few steps you need to take:

Get Hugging Face API Key

Grab an API key from Hugging Face with read access. Make sure you have access to the model you want to serve.

Add it to Baseten Secrets Manager

Paste your API key in your secrets manager in Baseten under the key hf_access_token. You can read more about secrets here.

Update Config

In your Truss’s config.yml, add the following code:

config.yml

...
secrets:
    hf_access_token: null
...

Make sure that the key secrets only shows up once in your config.yml.

If you run into any issues, run through all the steps above again and make sure you did not misspell the name of the repo or paste an incorrect API key.

Weights will be cached in the default Hugging Face cache directory, ~/.cache/huggingface/hub/models--{your_model_name}/. You can change this directory by setting the HF_HOME or HUGGINGFACE_HUB_CACHE environment variable in your config.yml.Read more here.

Google Cloud Storage

Google Cloud Storage is a great alternative to Hugging Face when you have a custom model or fine-tune you want to gate, especially if you are already using GCP and care about security and compliance. Your model_cache should look something like this:

config.yml

...
model_cache:
    repo_id: gcs://path-to-my-bucket
...

If you are accessing a public GCS bucket, you can ignore the following steps, but make sure you set appropriate permissions on your bucket. Users should be able to list and view all files. Otherwise, the model build will fail. For a private GCS bucket, first export your service account key. Rename it to be service_account.json and add it to the data directory of your Truss. Your file structure should look something like this:

your-truss
|--model
| └── model.py
|--data
|. └── service_account.json

If you are using version control, like git, for your Truss, make sure to add service_account.json to your .gitignore file. You don’t want to accidentally expose your service account key.

Weights will be cached at /app/model_cache/{your_bucket_name}.

Amazon Web Services S3

Another popular cloud storage option for hosting model weights is AWS S3, especially if you’re already using AWS services. Your model_cache should look something like this:

config.yml

...
model_cache:
    repo_id: s3://path-to-my-bucket
...

If you are accessing a public S3 bucket, you can ignore the subsequent steps, but make sure you set an appropriate appropriate policy on your bucket. Users should be able to list and view all files. Otherwise, the model build will fail. However, for a private S3 bucket, you need to first find your aws_access_key_id, aws_secret_access_key, and aws_region in your AWS dashboard. Create a file named s3_credentials.json. Inside this file, add the credentials that you identified earlier as shown below. Place this file into the data directory of your Truss. The key aws_session_token can be included, but is optional. Here is an example of how your s3_credentials.json file should look:

{
    "aws_access_key_id": "YOUR-ACCESS-KEY",
    "aws_secret_access_key": "YOUR-SECRET-ACCESS-KEY",
    "aws_region": "YOUR-REGION"
}

Your overall file structure should now look something like this:

your-truss
|--model
| └── model.py
|--data
|. └── s3_credentials.json

When you are generating credentials, make sure that the resulting keys have at minimum the following IAM policy:

{
        "Version": "2012-10-17",
        "Statement": [
            {
                "Action": [
                    "s3:GetObject",
                    "s3:ListObjects",
                ],
                "Effect": "Allow",
                "Resource": ["arn:aws:s3:::S3_BUCKET/PATH_TO_MODEL/*"]
            },
            {
                "Action": [
                    "s3:ListBucket",
                ],
                "Effect": "Allow",
                "Resource": ["arn:aws:s3:::S3_BUCKET"]
            }
        ]
    }

If you are using version control, like git, for your Truss, make sure to add s3_credentials.json to your .gitignore file. You don’t want to accidentally expose your service account key.

Weights will be cached at /app/model_cache/{your_bucket_name}.

Other Buckets

We can work with you to support additional bucket types if needed. If you have any suggestions, please leave an issue on our GitHub repo.

Getting started

Guides

Examples

Remotes

How to cache model weights

What is a “cold start”?

Enabling Caching for a Model

Cache invalidation

Hugging Face 🤗

Google Cloud Storage

Amazon Web Services S3

Other Buckets

Getting started

Guides

Examples

Remotes

​What is a “cold start”?

​Enabling Caching for a Model

​Cache invalidation

​Hugging Face 🤗

​Google Cloud Storage

​Amazon Web Services S3

​Other Buckets

What is a “cold start”?

Enabling Caching for a Model

Cache invalidation

Hugging Face 🤗

Google Cloud Storage

Amazon Web Services S3

Other Buckets