PEAK:AIO has introduced a dedicated solution designed to unify KVCache acceleration and GPU memory expansion for large-scale artificial intelligence workloads, focusing on addressing memory bottlenecks in large language model inference and model development.
The company's new platform, powered by CXL memory and integrated with Gen5 NVMe and GPUDirect RDMA, is positioned to provide up to 150 GB/sec sustained throughput with latency below five microseconds. This solution aims to support the growing demands of inference tasks, agentic systems, and model creation processes in AI deployments.
Eyal Lemberger, Chief AI Strategist and Co-Founder of PEAK:AIO, described the current landscape of AI memory requirements as evolving beyond static prompts toward more complex workloads: "Whether you are deploying agents that think across sessions or scaling toward million-token context windows, where memory demands can exceed 500GB per model, this appliance makes it possible by treating token history as memory, not storage. It is time for memory to scale like compute has."
As artificial intelligence models, particularly transformer-based architectures, increase in size and context, AI pipelines are encountering two main barriers: inefficiency with KVCache and saturation of GPU memory resources. According to the company, other vendors have attempted to adapt existing storage technologies or extend NVMe use to delay these limitations.
The platform from PEAK:AIO, referred to as the 1U Token Memory Platform, adopts a token-centric architecture built specifically for scalable artificial intelligence. The company states that this approach allows KVCache reuse across multiple sessions, models, and nodes, along with expanded context windows for longer model history, GPU memory offload using CXL, and low latency access by means of RDMA over NVMe-oF.
This platform diverges from traditional NVMe-based storage solutions by providing infrastructure that treats token memory as a primary resource rather than storing it as files. Teams are thus able to cache token history, attention maps, and streaming data at memory-like latency, which the company says is consistent with the performance requirements of advanced AI deployments.
PEAK:AIO's system is designed to align specifically with NVIDIA's KVCache reuse and memory management frameworks, providing direct support for users employing TensorRT-LLM or Triton, which the company claims results in accelerated inference speeds with minimal integration work. Leveraging CXL for memory-class performance, the platform delivers token memory operations with characteristics similar to RAM.
Lemberger commented further on the design philosophy behind the platform: "While others are bending file systems to act like memory, we built infrastructure that behaves like memory, because that is what modern AI needs. At scale, it is not about saving files; it is about keeping every token accessible in microseconds. That is a memory problem, and we solved it at embracing the latest silicon layer."
The solution is fully software-defined and can be deployed on off-the-shelf servers. The company anticipates entering production by the third quarter. PEAK:AIO will be making early access and technical consultations available for organisations interested in integrating the platform within their own AI infrastructure.
Mark Klarzynski, Co-Founder and Chief Strategy Officer at PEAK:AIO, highlighted the technical approach adopted by the company: "The big vendors are stacking NVMe to fake memory. We went the other way, leveraging CXL to unlock actual memory semantics at rack scale. This is the token memory fabric modern AI has been waiting for."