pgvector驱动的语义、混合、稀疏与量化向量搜索系统构建编码指南
阅读原文· marktechpost.com这份教程把 pgvector 的稀疏向量、量化搜索等高级功能打包成 Colab 代码,用 PostgreSQL 做向量数据库的团队可以直接复制粘贴跑起来。
本教程在Google Colab中构建一个完整的pgvector实验环境,展示PostgreSQL如何作为向量数据库服务于现代AI应用。内容涵盖安装PostgreSQL、编译pgvector扩展、通过Psycopg建立连接,并注册向量类型以实现与Python的平滑集成。最后使用SentenceTransformers创建并存储嵌入向量。
在本教程中,我们在 Google Colab 中构建了一个完整的 pgvector 游乐场,并探索 PostgreSQL 如何作为强大的向量数据库用于现代 AI 应用。首先,我们安装 PostgreSQL、编译 pgvector 扩展、通过 Psycopg 连接数据库、注册向量类型以实现与 Python 的无缝集成。接着,我们使用 SentenceTransformers 创建嵌入向量,将其存入 PostgreSQL,构建 HNSW 索引,并运行语义搜索、过滤搜索、距离度量比较、半精度存储、二进制量化、稀疏向量搜索、混合检索以及向量聚合。通过这一工作流,我们将学习 pgvector 如何仅用开源工具支持实用的检索增强生成、推荐、相似性搜索和混合搜索系统。
import os
import subprocess
import sys
import time
def sh(cmd: str, check: bool = True):
"""Run a shell command, streaming a compact log."""
print(f" $ {cmd}")
return subprocess.run(cmd, shell=True, check=check,
stdout=subprocess.DEVNULL, stderr=subprocess.STDOUT)
print("[0/10] Installing PostgreSQL + building pgvector (≈1–2 min)...")
sh("apt-get -qq update")
sh("apt-get -qq install -y postgresql postgresql-contrib "
"postgresql-server-dev-all build-essential git")
if not os.path.exists("/tmp/pgvector"):
sh("git clone --depth 1 https://github.com/pgvector/pgvector.git /tmp/pgvector")
sh("cd /tmp/pgvector && make && make install")
sh("service postgresql start")
time.sleep(3)
sh("""sudo -u postgres psql -c "ALTER USER postgres PASSWORD 'postgres';" """)
print("[0/10] Installing Python packages...")
sh(f"{sys.executable} -m pip install -q pgvector psycopg[binary] "
f"sentence-transformers numpy")我们搭建了完整的 PostgreSQL 和 pgvector 环境。安装了所需的系统包,从源码克隆并编译 pgvector,启动 PostgreSQL 服务,配置数据库密码。同时安装了连接 PostgreSQL 和处理向量嵌入所需的 Python 依赖。
import numpy as np
import psycopg
from pgvector import HalfVector, SparseVector
from pgvector.psycopg import register_vector
from sentence_transformers import SentenceTransformer
print("\n[1/10] Connecting and enabling the 'vector' extension...")
conn = psycopg.connect(
"host=127.0.0.1 port=5432 dbname=postgres user=postgres password=postgres",
autocommit=True,
)
conn.execute("CREATE EXTENSION IF NOT EXISTS vector")
register_vector(conn)
ver = conn.execute("SELECT extversion FROM pg_extension WHERE extname='vector'").fetchone()[0]
print(f" pgvector version: {ver}")
print("\n[2/10] Loading embedding model + encoding corpus...")
model = SentenceTransformer("all-MiniLM-L6-v2")
DIM = model.get_sentence_embedding_dimension()
corpus = [
("Octopuses have three hearts and blue blood.", "animals"),
("Transformers revolutionized natural language processing.","technology"),
("Quantum computers exploit superposition and entanglement.","technology"),
("GPUs accelerate deep learning by parallelizing matrix math.","technology"),
("Sourdough bread relies on wild yeast and lactobacilli.", "food"),
("Dark chocolate contains flavonoid antioxidants.", "food"),
("A black hole's gravity is so strong light cannot escape.","space")
]
contents = [c for c, _ in corpus]
categories = [k for _, k in corpus]
embeddings = model.encode(contents, normalize_embeddings=True)
conn.execute("DROP TABLE IF EXISTS documents")
conn.execute(f"""
CREATE TABLE documents (
id bigserial PRIMARY KEY,
content text,
category text,
embedding vector({DIM})
)
""")
with conn.cursor() as cur:
cur.executemany(
"INSERT INTO documents (content, category, embedding) VALUES (%s, %s, %s)",
list(zip(contents, categories, [np.asarray(e) for e in embeddings])),
)
print(f" Inserted {len(corpus)} documents with {DIM}-d embeddings.")我们连接 PostgreSQL,启用 pgvector 扩展,并在 Psycopg 中注册向量支持。加载 SentenceTransformers 模型,定义一个小型文本语料库,生成归一化嵌入向量,创建用于存储文档的 PostgreSQL 表。然后,将每个文档及其类别和向量表示插入表中,以便后续执行语义搜索。
print("\n[3/10] Building HNSW index and running semantic search...")
conn.execute(
"CREATE INDEX ON documents USING hnsw (embedding vector_cosine_ops) "
"WITH (m = 16, ef_construction = 64)"
)
conn.execute("SET hnsw.ef_search = 100")
def semantic_search(query: str, k: int = 4):
q = np.asarray(model.encode(query, normalize_embeddings=True))
return conn.execute(
"SELECT content, category, embedding <=> %s AS distance "
"FROM documents ORDER BY distance LIMIT %s",
(q, k),
).fetchall()
for content, cat, dist in semantic_search("animals that are unusually quick"):
print(f" {dist:.3f} [{cat:<10}] {content}")
print("\n[4/10] Filtered search (only category = 'space')...")
q = np.asarray(model.encode("objects with extreme gravity", normalize_embeddings=True))
rows = conn.execute(
"SELECT content, embedding <=> %s AS distance "
"FROM documents WHERE category = %s ORDER BY distance LIMIT 3",
(q, "space"),
).fetchall()
for content, dist in rows:
print(f" {dist:.3f} {content}")
print("\n[5/10] Same query under different distance metrics (top hit each)...")
q = np.asarray(model.encode("brewing a hot caffeinated drink", normalize_embeddings=True))
for op, label in [("<->", "L2"), ("<=>", "cosine"), ("<#>", "neg-inner"), ("<+>", "L1")]:
content, score = conn.execute(
f"SELECT content, embedding {op} %s AS s FROM documents ORDER BY s LIMIT 1", (q,)
).fetchone()
print(f" {label:<10} {score:+.3f} {content}")我们在嵌入向量列上构建 HNSW 索引,以实现更快、更高效的向量搜索。定义了一个语义搜索函数,将查询转换为嵌入向量,并使用余弦相似度检索最相似的文档。我们还执行了带有元数据过滤的搜索,并比较了不同的 pgvector 距离运算符,如 L2、余弦、负内积和 L1。
print("\n[6/10] Half-precision storage with halfvec...")
conn.execute(f"ALTER TABLE documents ADD COLUMN IF NOT EXISTS embedding_half halfvec({DIM})")
conn.execute("UPDATE documents SET embedding_half = embedding::halfvec")
conn.execute(
"CREATE INDEX ON documents USING hnsw (embedding_half halfvec_cosine_ops)"
)
q_half = HalfVector(model.encode("the galaxy we live in", normalize_embeddings=True))
rows = conn.execute(
"SELECT content, embedding_half <=> %s AS d FROM documents ORDER BY d LIMIT 2",
(q_half,),
).fetchall()
for content, d in rows:
print(f" {d:.3f} {content}")
print("\n[7/10] Binary quantization (Hamming) + exact re-rank...")
conn.execute(
f"CREATE INDEX ON documents "
f"USING hnsw ((binary_quantize(embedding)::bit({DIM})) bit_hamming_ops)"
)
q = np.asarray(model.encode("parallel hardware for AI training", normalize_embeddings=True))
rerank_sql = f"""
SELECT content, candidates.embedding <=> %(q)s AS exact_distance
FROM (
SELECT content, embedding
FROM documents
ORDER BY binary_quantize(embedding)::bit({DIM})
<~> binary_quantize(%(q)s)::bit({DIM})
LIMIT 8
) AS candidates
ORDER BY exact_distance
LIMIT 3
"""
for content, d in conn.execute(rerank_sql, {"q": q}).fetchall():
print(f" {d:.3f} {content}")
print("\n[8/10] Native sparse vectors...")
conn.execute("DROP TABLE IF EXISTS sparse_items")
conn.execute("CREATE TABLE sparse_items (id bigserial PRIMARY KEY, embedding sparsevec(10))")
sparse_data = [
SparseVector({0: 1.0, 3: 2.0, 7: 1.5}, 10),
SparseVector({1: 0.5, 3: 1.0, 9: 3.0}, 10),
SparseVector({0: 0.2, 4: 2.5, 7: 0.8}, 10),
]
with conn.cursor() as cur:
cur.executemany("INSERT INTO sparse_items (embedding) VALUES (%s)",
[(v,) for v in sparse_data])
query_sparse = SparseVector({0: 1.0, 7: 1.0}, 10)
rows = conn.execute(
"SELECT id, embedding, embedding <#> %s AS neg_ip "
"FROM sparse_items ORDER BY neg_ip LIMIT 3",
(query_sparse,),
).fetchall()
for _id, vec, neg_ip in rows:
print(f" id={_id} inner_product={-neg_ip:.2f} nnz_indices={vec.indices()}")我们探索了超越标准稠密向量的高级 pgvector 存储与检索技术。我们将嵌入向量转换为半精度向量以减少存储,使用带汉明搜索的二进制量化进行快速候选检索,然后用全精度向量重新排序。我们还创建了稀疏向量,并通过内积相似度进行查询,这在关键词加权或 SPLADE 风格的检索中非常有用。
print("\n[9/10] Hybrid search (vector + full-text) via RRF...")
user_query = "fast animal"
qvec = np.asarray(model.encode(user_query, normalize_embeddings=True))
hybrid_sql = """
WITH semantic AS (
SELECT id, RANK() OVER (ORDER BY embedding <=> %(qvec)s) AS rank
FROM documents
ORDER BY embedding <=> %(qvec)s
LIMIT 20
),
keyword AS (
SELECT d.id,
RANK() OVER (ORDER BY ts_rank_cd(to_tsvector('english', d.content), q) DESC) AS rank
FROM documents d, plainto_tsquery('english', %(qtext)s) AS q
WHERE to_tsvector('english', d.content) @@ q
LIMIT 20
)
SELECT d.content,
COALESCE(1.0 / (60 + semantic.rank), 0.0)
+ COALESCE(1.0 / (60 + keyword.rank), 0.0) AS rrf_score
FROM documents d
LEFT JOIN semantic ON d.id = semantic.id
LEFT JOIN keyword ON d.id = keyword.id
WHERE semantic.id IS NOT NULL OR keyword.id IS NOT NULL
ORDER BY rrf_score DESC
LIMIT 4
"""
for content, score in conn.execute(hybrid_sql, {"qvec": qvec, "qtext": user_query}).fetchall():
print(f" {score:.5f} {content}")
print("\n[10/10] Aggregating vectors with AVG (category centroid)...")
centroid = conn.execute(
"SELECT AVG(embedding) FROM documents WHERE category = %s", ("food",)
).fetchone()[0]
typical = conn.execute(
"SELECT content, embedding <=> %s AS d FROM documents "
"WHERE category = %s ORDER BY d LIMIT 1",
(np.asarray(centroid), "food"),
).fetchone()
print(f" Centroid dim = {len(centroid)}")
print(f" Most representative 'food' doc: {typical[0]}")
print("\n
Done. You now have a working pgvector playground inside Colab.")
print(" Try editing `corpus`, the queries, or swap in your own embedding model.")我们使用倒数排序融合将语义向量搜索与 PostgreSQL 全文搜索相结合。我们从语义排名和关键词排名中分别检索结果,合并它们的分数,然后生成更强的混合搜索输出。最后,我们计算某个类别的平均嵌入向量,并将其作为质心,以找到该组中最具代表性的文档。
总而言之,我们拥有一个基于 pgvector 的可运行检索系统,它完全在 Google Colab 中运行,无需外部服务或 API 密钥。我们不仅将 PostgreSQL 用作传统的关系型数据库,还将其用作灵活的向量搜索引擎,支持稠密向量、半精度向量、二进制量化检索、稀疏向量、全文搜索和聚合操作。我们还观察到了元数据过滤、HNSW 索引、倒数排序融合以及基于质心的分析如何使 pgvector 在现实世界的 AI 搜索管线中发挥作用。