# Clark Hash：神经网络嵌入向量的无状态稀疏Johnson-Lindenstrauss量化

- 来源：HuggingFace Daily Papers（社区热门论文）
- 发布时间：2026-05-27 08:00
- AIHOT 分数：59
- AIHOT 链接：https://aihot.virxact.com/items/cmpp8dvu20creslv4c1gnzzrd
- 原文链接：https://arxiv.org/abs/2605.28034

## AI 摘要

Clark Hash是一种用于紧凑存储神经网络嵌入向量的无状态编解码方法。在默认的384维句子嵌入设置下，它将一个余弦搜索向量存储为48字节的固定宽度标量量化码，相比使用f32格式的密集存储（需1536字节），实现了32倍的压缩。该方法无需训练过程、学习码本或预先计算语料库统计信息。基于多语言MiniLM编码器的评估显示，其48字节草稿与密集余弦分数在STS17和STS22测试集上的宏皮尔逊相关系数分别达到了0.910和0.946。

## 正文

Clark Hash is a small method for storing neural embeddings in less space. It normalizes each database vector, applies a deterministic sparse signed Johnson-Lindenstrauss projection, clips the result, and stores a fixed-width scalar-quantized code. Queries stay in floating point and are scored against the stored sketches. In the default 384-dimensional sentence-embedding setting, Clark Hash stores a cosine-search vector in 48 bytes instead of 1536 bytes for dense f32 storage. This is 32x smaller. The method does not need a training pass, learned codebooks, rotations, or corpus statistics before new vectors can be stored. We describe the codec, the Rust implementation, and a multilingual sentence-similarity evaluation on 9,304 labeled pairs from 29 subsets. With a multilingual MiniLM encoder, the 48-byte sketches reached 0.910 and 0.946 macro Pearson correlation with dense cosine scores on STS17 and STS22. Clark Hash is not a new Johnson-Lindenstrauss theorem and it is not a replacement for approximate nearest-neighbor indexes. It is a simple stateless codec for compact embedding storage.
