StreamIndex:基于内存的压缩稀疏注意力机制的流式 Top-k 方法

#Tech

StreamIndex:基于内存的压缩稀疏注意力机制的流式 Top-k 方法

DeepSeek-V3.2 和 V4 引入了压缩稀疏注意力(CSA)机制,其核心是基于学习的评分投影对压缩后的键进行排序,选择 Top-k 元素参与注意力计算。

StreamIndex 是一种 Triton 实现,它使用分块合并 Top-k 驱动程序,避免了中间结果的完全显式存储,有效解决了内存瓶颈问题。

在 H200 GPU 上,StreamIndex 能够将序列长度扩展至 1,048,576,比传统方法提升了 32 倍,峰值 HBM 占用量为 6.21GB。

实验表明,该方法在设计空间内的表现稳定,召回率接近 100%,并且与传统方法相比,在性能上也有显著提升。

查看原文开头(英文 · 仅前 3 段)

View PDF

HTML (experimental)

Abstract:DeepSeek-V3.2 and V4 introduce Compressed Sparse Attention (CSA): a lightning indexer (a learned scoring projection over compressed keys) scores them, the top-k are selected per query, and a sparse attention kernel reads only those. Public CSA implementations materialize a [B, S, H_I, T] FP32 score tensor before the top-k reduction. With H_I=64 indexer heads and the V4-Flash compression ratio m=4, that intermediate is 256 GB at sequence length S=65,536, exceeding any single-GPU high-bandwidth-memory (HBM) budget. We present StreamIndex, a Triton implementation of the CSA pipeline whose central component is a chunked partition-merge top-k driver that never materializes the full intermediate. On synthetic-but-realistic V4-shaped inputs at the indexer-step (layer) level on a single NVIDIA H200, the materialize path runs out of memory (OOMs) at S=65,536 with V4-Flash dimensions; StreamIndex runs the same indexer to S=1,048,576 with 6.21 GB peak HBM, a 32x regime extension. Set-overlap recall against the materialize ground truth is bit-exact at small S where both fit; across three 5-point design-space sweeps (chunk size, key-tile size, top-k), mean recall rounds to 1.0000 with min recall at least 0.9980 in every cell. The chunked driver composes with TileLang's pipelined attention kernel: at S=262,144 with V4-Flash dimensions, the materialize indexer paired with TileLang attention OOMs while the chunked indexer paired with the same attention runs in 1.97 s at 18.56 GB peak. Our contribution targets the indexer step; we make no claim of a faster attention kernel or of real-checkpoint end-to-end behavior. Code: this https URL.

※ 出于版权考虑,仅引用前 3 段。完整内容请阅读原文。

阅读原文 ↗