由 John Doe 十一月 21, 2025
pg_tokenizer 是一款独立的、多语言的高级搜索分词器,可用于在 PostgreSQL 中进行全文检索和向量化语义搜索。

介绍
分词,即将文本分解成有意义的单元(词元)的过程,极其复杂,并且高度依赖于具体的语言和使用场景。在一个单一的扩展程序中支持多种语言及其独特的规则,自定义词典、词干提取、停用词以及不同的分词策略,变得越来越繁琐。
pg_tokenizer 使用专用的分词器,如 Jieba(中文)和 Lindera(日语),直接处理各种语言需求,以及功能强大的基于 LLM 的多语言分词器(如 Gemma2 和 LLMLingua2),这些分词器在涵盖各种语言的大量数据集上进行了训练。
示例
1. 使用预训练的多语言 LLM 分词器 (LLMLingua2)
基于 LLM 的分词器非常适合处理多种不同语言的文本,如下:
-- Enable the extensions (if not already done)
CREATE EXTENSION IF NOT EXISTS vchord_bm25;
CREATE EXTENSION IF NOT EXISTS pg_tokenizer;
-- Update search_path for the first time
ALTER SYSTEM SET search_path TO "$user", public, tokenizer_catalog, bm25_catalog;
SELECT pg_reload_conf();
-- Create a tokenizer configuration using the LLMLingua2 model
SELECT create_tokenizer('llm_tokenizer', $$
model = "llmlingua2"
$$);
-- Tokenize some English text
SELECT tokenize('PostgreSQL is a powerful, open source database.', 'llm_tokenizer');
-- Output: {2795,7134,158897,83,10,113138,4,9803,31344,63399,5} -- Example token IDs
-- Tokenize some Spanish text (LLMLingua2 handles multiple languages)
SELECT tokenize('PostgreSQL es una potente base de datos de código abierto.', 'llm_tokenizer');
-- Output: {2795,7134,158897,198,220,105889,3647,8,13084,8,55845,118754,5} -- Example token IDs
-- Integrate with a table
CREATE TABLE documents (
id SERIAL PRIMARY KEY,
passage TEXT,
embedding bm25vector
);
INSERT INTO documents (passage) VALUES ('PostgreSQL is a powerful, open source database.');
UPDATE documents
SET embedding = tokenize(passage, 'llm_tokenizer')
WHERE id = 1; -- Or process the whole table
2. 创建带有过滤器的自定义分词器(例如:英语)
下面的示例定义了一个自定义管道,包括小写转换、Unicode 规范化、跳过非字母数字标记、使用 NLTK 英文停用词以及 Porter2 词干提取器。然后,它会自动训练一个模型,并设置一个触发器,以便在插入/更新时对文本进行标记化。
CREATE TABLE articles (
id SERIAL PRIMARY KEY,
content TEXT,
embedding bm25vector
);
-- Define a custom text analysis pipeline
SELECT create_text_analyzer('english_analyzer', $$
pre_tokenizer = "unicode_segmentation" # Basic word splitting
[[character_filters]]
to_lowercase = {} # Lowercase everything
[[character_filters]]
unicode_normalization = "nfkd" # Normalize Unicode
[[token_filters]]
skip_non_alphanumeric = {} # Remove punctuation-only tokens
[[token_filters]]
stopwords = "nltk_english" # Use built-in English stopwords
[[token_filters]]
stemmer = "english_porter2" # Apply Porter2 stemming
$$);
-- Create tokenizer, custom model based on 'articles.content', and trigger
SELECT create_custom_model_tokenizer_and_trigger(
tokenizer_name => 'custom_english_tokenizer',
model_name => 'article_model',
text_analyzer_name => 'english_analyzer',
table_name => 'articles',
source_column => 'content',
target_column => 'embedding'
);
-- Now, inserts automatically generate tokens
INSERT INTO articles (content) VALUES
('VectorChord-BM25 provides advanced ranking features for PostgreSQL users.');
SELECT embedding FROM articles WHERE id = 1;
-- Output: {1:1, 2:1, 3:1, 4:1, 5:1, 6:1, 7:1, 8:1}
-- Bm25vector based on the custom model and pipeline
3. 使用 Jieba 处理中文文本
CREATE TABLE chinese_docs (
id SERIAL PRIMARY KEY,
passage TEXT,
embedding bm25vector
);
-- Define a text analyzer using the Jieba pre-tokenizer
SELECT create_text_analyzer('jieba_analyzer', $$
[pre_tokenizer.jieba]
# Optional Jieba configurations can go here
$$);
-- Create tokenizer, custom model, and trigger for Chinese text
SELECT create_custom_model_tokenizer_and_trigger(
tokenizer_name => 'chinese_tokenizer',
model_name => 'chinese_model',
text_analyzer_name => 'jieba_analyzer',
table_name => 'chinese_docs',
source_column => 'passage',
target_column => 'embedding'
);
-- Insert Chinese text
INSERT INTO chinese_docs (passage) VALUES
('红海早过了,船在印度洋面上开驶着。'); -- Example sentence
SELECT embedding FROM chinese_docs WHERE id = 1;
-- Output: {1:1, 2:1, 3:1, 4:1, 5:1, 6:1, 7:1, 8:1, 9:1, 10:1, 11:1, 12:1}
-- Bm25vector based on Jieba segmentation
参考
pg_tokenizer:https://github.com/tensorchord/pg_tokenizer.rs