Overview of caching in Cloud Storage FUSE

Cloud Storage FUSE provides four types of optional caching to help increase the performance of data retrieval:

File caching overview

The Cloud Storage FUSE file cache is a client-based read cache that serves repeat file reads from a faster cache storage of your choice.

Benefits of file caching

  • Improved performance: file caching improves latency and throughput by serving reads directly from the cache media. Small and random I/O operations can be significantly faster when served from the cache.

  • Use existing capacity: file caching can use existing provisioned machine capacity for your cache directory without incurring charges for additional storage. This includes Local SSDs that come bundled with Cloud GPUs machine types such as a2-ultragpu, a3-highgpu, Persistent Disk (which is the boot disk used by each VM), or in-memory /tmpfs.

  • Reduced charges: cache hits are served locally and don't incur Cloud Storage operation or network charges.

  • Improved total cost of ownership for AI and ML training: file caching increases Cloud GPUs and Cloud TPU utilization by loading data faster which reduces time to training and provides a greater price-performance ratio for AI and ML training workloads.

Enable and configure the file cache

The file cache is disabled by default and can be enabled and configured using a Cloud Storage FUSE configuration file. You can control caching behavior using the following fields:

  • max-size-mb: controls the maximum capacity in your cache directory that cached data can occupy. By default, the max-size-mb field is set to let cached data grow until it occupies all the available capacity in your cache directory.

  • cache-dir: specifies a directory for storing file cache data. Note that specifying a cache directory is a prerequisite for enabling the file cache.

  • ttl-secs: determines when cached data becomes stale and needs to be refreshed from Cloud Storage. By default, the ttl-secs field is set to expire and refresh from Cloud Storage after 60 seconds of inactivity. We recommend increasing this value.

    To learn how to control cache data invalidation, see Configuring cache data invalidation. For more information about the eviction of cached data, see Eviction.

  • enable-parallel-downloads: accelerates read performance for large files over 1 GB in size, including first-time reads, by using multiple workers to download a file in parallel using the file cache directory as a prefetch buffer. We recommend enabling parallel downloads for serving and checkpoint restore operations. For more information on enabling and configuring parallel downloads, see Configure parallel downloads.

Random & Partial Reads

If the first file read operation starts from the beginning of the file, at offset 0, the Cloud Storage FUSE file cache ingests and loads the entire file into the cache, even if you're only reading from a small range subset. This lets subsequent random or partial reads from the same object get served directly from the cache.

If a file's first read operation starts from anywhere other than offset 0, Cloud Storage FUSE, by default, doesn't trigger an asynchronous full file fetch. To change this behavior so that Cloud Storage FUSE ingests a file to the cache upon an initial random read, set the cache-file-for-range-read flag to true. We recommend that you enable the cache-file-for-range-read flag if many different random or partial read operations are performed on the same object.

Eviction

The eviction of cached metadata and data is based on a least recently used (LRU) algorithm that begins once the space threshold configured per max-size-mb limit is reached. If the entry expires based on its TTL, a Get metadata call is first made to Cloud Storage and is subject to network latencies. Since the data and metadata are managed separately, you might experience one entity being evicted or invalidated and not the other.

Performance

Cloud Storage FUSE caching works with any user-specified directory that's backed by your choice of storage, such as Local SSD, Persistent Disk, in-memory tmpfs, or Filestore. Cloud Storage FUSE cache performance matches underlying storage used by the cache with minimal overhead. To learn more about caching performance, see Cloud Storage FUSE caching performance and best practices.

Persistence

Cloud Storage FUSE caches aren't persisted on unmounts and restarts. For file caching, while the metadata entries needed to serve files from the cache are evicted on unmounts and restarts, data in the file cache might still be present in the file directory. You should delete data in the file cache directory after unmounts or restarts.

Security

When you enable caching, Cloud Storage FUSE uses the cache directory you specified using the cache-dir field as the underlying directory for the cache to persist files from your Cloud Storage bucket in an unencrypted format. Any user or process that has access to this cache directory can access these files. We recommend restricting access to this directory.

Direct or multiple access to the file cache

Using a process other than Cloud Storage FUSE to access or modify a file in the cache directory can lead to data corruption. Cloud Storage FUSE caches are specific to each Cloud Storage FUSE running process with no awareness across different Cloud Storage FUSE processes running on the same or different machines. Therefore, we don't recommend using the same cache directory for different Cloud Storage FUSE processes.

If multiple Cloud Storage FUSE processes need to run on the same machine, each Cloud Storage FUSE process should get its own specific cache directory, or use one of following methods to ensure your data doesn't get corrupted:

  • Mount all buckets with a shared cache: use dynamic mounting to mount all buckets you have access to in a single process with a shared cache. To learn more, see Cloud Storage FUSE dynamic mounting.

  • Enable caching on a specific bucket: enable caching on only a specified bucket using static mounting. To learn more, see Cloud Storage FUSE static mounting.

  • Cache only a specific folder or directory: mount and cache only a specific bucket-level folder instead of mounting an entire bucket. To learn more, see Mount a directory within a bucket.

Stat caching overview

The Cloud Storage FUSE stat cache is a cache for object metadata that improves performance for operations specific to file attributes such as size, modification time, or permissions. Using stat cache improves latency by using cached data to perform operations instead of sending a stat object request to Cloud Storage. By default, the stat cache is enabled with a stat-cache-max-size-mb value of 32 MB and a ttl-secs value set to 60 seconds. We recommend increasing both values. To learn more about stat caching, see the Semantics documentation on GitHub.

Type caching overview

The Cloud Storage FUSE type cache is a metadata cache that accelerates performance for metadata operations specific to file or directory existence. Using type cache improves latency by reducing the number of requests made to Cloud Storage to check if a file or directory exists by storing this information locally. By default, the type cache is enabled with a type-cache-max-size-mb value of 4 MB and a ttl-secs value of 60 seconds by default. We recommend increasing both values. To learn more about type caching, see the Semantics documentation on GitHub.

List caching overview

The Cloud Storage FUSE list cache is for directory and file list, or ls, responses that improves list operation speeds. List caching is especially useful for workloads that repeat full directory listings as part of execution, such as AI/ML training runs.

The list cache is kept in memory in the page cache, which is controlled by the kernel based on memory availability, as opposed to the stat and type caches, which are kept in your machine's memory and controlled by Cloud Storage FUSE.

Enable list caching

The list cache is disabled by default. You can enable list caching using the kernel-list-cache-ttl-secs field with one of the following values:

  • A positive value which represents the time to live (TTL) in seconds to keep the directory list response in the kernel's page cache.

  • A value of -1 to bypass entry expiration and return the list response from the cache when it's available.

To enable and configure list caching, see the Cloud Storage FUSE configuration file.

Configure cache invalidation

The following sections describe how to configure cache invalidation for all cache types.

File, stat, and type cache invalidation

For file, stat, and type caches, the ttl-secs field specifies the TTL in seconds for how long cached metadata is used from when it's fetched from Cloud Storage to when it expires and needs to be refreshed.

You can configure ttl-secs in a Cloud Storage FUSE configuration file.

The ttl-secs field is set to 60 by default. When you specify a value for ttl-secs that's greater than 0, the metadata for the file cache remains valid only for the amount of time you specified. For file caching, we recommend increasing the ttl-secs value based on the expected time between repeat reads while you balance consistency needs. Based on the importance and frequency of the data changing, we recommend setting the ttl-secs value as high as your workload lets you. When a metadata entry becomes invalid, subsequent reads are queried from Cloud Storage.

In addition to accepting values that represent a specific TTL in seconds before your cached metadata expires and needs to be refreshed, you can use the following values to specify how your file is read:

  • ttl-secs value of 0: ensures the file with the most up to date data is read by issuing a GET metadata call to Cloud Storage that checks the file it's serving from to ensure the cache is consistent. If the file in the cache is up to date, it's served directly from the cache. Specifying a value other than 0 can lead to reduced performance because a call must always be made to Cloud Storage to check the metadata first. If the file is in the cache and hasn't changed, the file is served from the cache with consistency after the GET metadata call.

  • ttl-secs value of -1: ensures the file is always read from the cache if it's available, without checking for consistency. Serving files without checking for consistency can serve inconsistent data, and should only be used temporarily for workloads that run in jobs with non-changing data. For example, using a value of -1 is useful for machine learning training, where the same data is read across multiple epochs without changes.

List cache invalidation

List cache invalidation is set by specifying a value greater than 0 using the kernel-list-cache-ttl-secs field. The directory list response is kept in the kernel's page cache and remains valid for the amount of time you specified. By default, the list cache is disabled and set to a value of 0. When you specify a value of -1, Cloud Storage FUSE disables list cache expiration and returns the list response from the cache when it's available.

Read path for cached data

The Cloud Storage FUSE cache accelerates repeat reads after they've been ingested to the cache. Both first-time reads and cache misses go directly to Cloud Storage and are subject to normal Cloud Storage network latencies.

Considerations

  • Enabling file caching, stat caching, type caching, or list caching can increase performance but reduce consistency, which usually occurs when you access the same bucket using multiple clients with a high change rate. To reduce the impact on consistency, we recommend mounting buckets as read-only. To learn more about caching behavior, see Cloud Storage FUSE semantics documentation on GitHub.

  • If a file cache entry hasn't yet expired based on its TTL and the file is in the cache, the entire operation is served from the local client cache without any request being issued to Cloud Storage.

  • If a file cache entry has expired based on its TTL, a Get metadata call is first made to Cloud Storage, and if the file isn't in the cache, the file is retrieved from Cloud Storage. Both operations are subject to network latencies. If the metadata entry has been invalidated, but the file is in the cache, and its object generation has not changed, the file is served from the cache only after the Get metadata call is made to check if the data is valid.

  • If a Cloud Storage FUSE client modifies a cached file or its metadata, then the file is immediately invalidated and consistency is ensured in the following read by the same client. However, if different clients access the same file or its metadata, and its entries are cached, then the cached version of the file or metadata is read and not the updated version until the file is invalidated by that specific client's TTL setting.

What's next