Training-Free Exponential Extension of Sliding Window Context with Cascading KV Cache

发布于 2024-06-27  2 次阅读


AI 摘要

根据文章内容,研究表明,在变压器中,上下文窗口提供了当前任务的一种积极记忆形式,这对于少样本学习和有条件生成非常有用,两者都严重依赖于以前的上下文标记。然而,随着上下文长度的增加,计算成本呈二次增长。最近的研究表明,除了保存一些初始标记以外,再加上一个固定大小的滑动窗口可以在基于变压器的大型语言模型中实现稳定的流式生成,并且复杂度是线性的。然而,它们通过简单地无条件地将所有标记从键值(KV)缓存中驱逐出去来对固定窗口做出了次优的利用,导致标记被遗忘,不再能影响后续的预测。为了克服这一局限,我们提出了一种新颖的机制,通过保持独立级联子缓冲区,其中每个后续缓冲区有条件地接受从前一缓冲区驱逐出去的相对更重要的关键字,并以相同的总缓存大小存储更长的滑动窗口上下文。我们的方法导致了一个动态的KV缓存,可以存储比固定的静态滑动窗口方法更久远的标记。我们的实验表明,在具有相同固定缓存大小的情况下,使用LLMs可以在长上下文生成(LongBench)上提高5.6%,在流式困惑度(PG19)上提高1.2%,在语言理解(MMLU STEM)上提高0.6%。此外,我们还提供了一个有效的实现,将KV缓存的延迟从每个缓存操作的1.33毫秒提高到0.54毫秒,比以前的工作加快了59%。

[PDF] [Site] [Kimi]

The context window within a transformer provides a form of active memory for the current task, which can be useful for few-shot learning and conditional generation, both which depend heavily on previous context tokens. However, as the context length grows, the computational cost increases quadratically. Recent works have shown that saving a few initial tokens along with a fixed-sized sliding window leads to stable streaming generation with linear complexity in transformer-based Large Language Models (LLMs). However, they make suboptimal use of the fixed window by naively evicting all tokens unconditionally from the key-value (KV) cache once they reach the end of the window, resulting in tokens being forgotten and no longer able to affect subsequent predictions. To overcome this limitation, we propose a novel mechanism for storing longer sliding window contexts with the same total cache size by keeping separate cascading sub-cache buffers whereby each subsequent buffer conditionally accepts a fraction of the relatively more important tokens evicted from the previous buffer. Our method results in a dynamic KV cache that can store tokens from the more distant past than a fixed, static sliding window approach. Our experiments show improvements of 5.6% on long context generation (LongBench), 1.2% in streaming perplexity (PG19), and 0.6% in language understanding (MMLU STEM) using LLMs given the same fixed cache size. Additionally, we provide an efficient implementation that improves the KV cache latency from 1.33ms per caching operation to 0.54ms, a 59% speedup over previous work.

Hello
最后更新于 2024-08-02