Alluxio 3.6 improves AI model distribution & checkpoint speed

Today

Alluxio has released Alluxio Enterprise AI 3.6, adding new features aimed at improving AI model distribution, training checkpoint writing performance, and multi-tenancy for organisations managing data-intensive workloads.

AI model sizes are increasing, and distributing them from training environments to production can create both latency challenges and higher cloud costs, according to the company. Lengthy checkpoint writing processes during training further slow down AI development timelines.

Haoyuan (HY) Li, Founder and Chief Executive Officer of Alluxio, said: "We are excited to announce that we have extended our AI acceleration platform beyond model training to also accelerate and simplify the process of distributing AI models to production inference serving environments. By collaborating with customers at the forefront of AI, we continue to push the boundaries of what anyone thought possible just a year ago."

The main new feature in version 3.6 is high-performance model distribution, using the Alluxio Distributed Cache. The approach places cache infrastructure in each region, so that model files only need to be transferred from the main repository to each regional cache once, instead of once per individual server. Inference servers can then access the required models directly from this regional cache.

The latest release builds in additional optimisations such as local caching on inference servers and more efficient use of memory pools. According to benchmarks provided by Alluxio, the AI Acceleration Platform achieved data transfer speeds of 32 GiB/s, which the company states is 20 GiB/s faster than the available network capacity of 11.6 GiB/s.

Another major addition in Alluxio Enterprise AI 3.6 is the new ASYNC write mode, introduced alongside the earlier CACHE_ONLY Write Mode. ASYNC write mode is designed to accelerate checkpoint writing during model training, providing up to 9GB/s write throughput in environments with a 100 Gbps network. The improvement is achieved by directing checkpoint data to the cache first, then asynchronously writing it to the main file system, which avoids network and storage bottlenecks that can occur with direct writes.

To support management and observability, Alluxio 3.6 launches a new web-based Management Console. This interface enables administrators to monitor cluster status, including cache usage and the health of coordinators and worker nodes. The console displays critical statistics such as read and write throughput and cache hit rates. It also allows administrators to manage mount tables, apply storage quotas, set priorities and time-to-live (TTL) policies, submit cache jobs, and collect diagnostics, all without requiring command-line tools.

Multi-tenancy enhancements are introduced with this release through integration with the Open Policy Agent (OPA). This feature allows administrators to set up detailed role-based access controls for different teams sharing a single Alluxio cache, increasing both security and flexibility.

Version 3.6 also adds support for multi-Availability Zone failover, designed to provide consistent data access and high availability for businesses running workloads in multiple regions. This is intended to bolster resilience and ensure that data remains accessible even in the event of an outage in one zone.

Another addition is Virtual Path Support in FUSE, offering the capability to define custom access paths for users and applications. This provides a layer of abstraction over the physical storage locations, potentially simplifying how teams access distributed data resources.

Share on: