pg_tokenizer: 多语言高级搜索分词器

由 John Doe 十一月 21, 2025

pg_tokenizer 是一款独立的、多语言的高级搜索分词器，可用于在 PostgreSQL 中进行全文检索和向量化语义搜索。

介绍

分词，即将文本分解成有意义的单元（词元）的过程，极其复杂，并且高度依赖于具体的语言和使用场景。在一个单一的扩展程序中支持多种语言及其独特的规则，自定义词典、词干提取、停用词以及不同的分词策略，变得越来越繁琐。

pg_tokenizer 使用专用的分词器，如 Jieba（中文）和 Lindera（日语），直接处理各种语言需求，以及功能强大的基于 LLM 的多语言分词器（如 Gemma2 和 LLMLingua2），这些分词器在涵盖各种语言的大量数据集上进行了训练。

示例

1. 使用预训练的多语言 LLM 分词器 (LLMLingua2)

基于 LLM 的分词器非常适合处理多种不同语言的文本，如下：

-- Enable the extensions (if not already done)
CREATE EXTENSION IF NOT EXISTS vchord_bm25;
CREATE EXTENSION IF NOT EXISTS pg_tokenizer;

-- Update search_path for the first time
ALTER SYSTEM SET search_path TO "$user", public, tokenizer_catalog, bm25_catalog;
SELECT pg_reload_conf();

-- Create a tokenizer configuration using the LLMLingua2 model
SELECT create_tokenizer('llm_tokenizer', $$
model = "llmlingua2"
$$);

-- Tokenize some English text
SELECT tokenize('PostgreSQL is a powerful, open source database.', 'llm_tokenizer');
-- Output: {2795,7134,158897,83,10,113138,4,9803,31344,63399,5} -- Example token IDs

-- Tokenize some Spanish text (LLMLingua2 handles multiple languages)
SELECT tokenize('PostgreSQL es una potente base de datos de código abierto.', 'llm_tokenizer');
-- Output: {2795,7134,158897,198,220,105889,3647,8,13084,8,55845,118754,5} -- Example token IDs

-- Integrate with a table
CREATE TABLE documents (
    id SERIAL PRIMARY KEY,
    passage TEXT,
    embedding bm25vector
);

INSERT INTO documents (passage) VALUES ('PostgreSQL is a powerful, open source database.');

UPDATE documents
SET embedding = tokenize(passage, 'llm_tokenizer')
WHERE id = 1; -- Or process the whole table

2. 创建带有过滤器的自定义分词器（例如：英语）

下面的示例定义了一个自定义管道，包括小写转换、Unicode 规范化、跳过非字母数字标记、使用 NLTK 英文停用词以及 Porter2 词干提取器。然后，它会自动训练一个模型，并设置一个触发器，以便在插入/更新时对文本进行标记化。

CREATE TABLE articles (
    id SERIAL PRIMARY KEY,
    content TEXT,
    embedding bm25vector
);

-- Define a custom text analysis pipeline
SELECT create_text_analyzer('english_analyzer', $$
pre_tokenizer = "unicode_segmentation"  # Basic word splitting
[[character_filters]]
to_lowercase = {}                       # Lowercase everything
[[character_filters]]
unicode_normalization = "nfkd"          # Normalize Unicode
[[token_filters]]
skip_non_alphanumeric = {}              # Remove punctuation-only tokens
[[token_filters]]
stopwords = "nltk_english"              # Use built-in English stopwords
[[token_filters]]
stemmer = "english_porter2"             # Apply Porter2 stemming
$$);

-- Create tokenizer, custom model based on 'articles.content', and trigger
SELECT create_custom_model_tokenizer_and_trigger(
    tokenizer_name => 'custom_english_tokenizer',
    model_name => 'article_model',
    text_analyzer_name => 'english_analyzer',
    table_name => 'articles',
    source_column => 'content',
    target_column => 'embedding'
);

-- Now, inserts automatically generate tokens
INSERT INTO articles (content) VALUES
('VectorChord-BM25 provides advanced ranking features for PostgreSQL users.');

SELECT embedding FROM articles WHERE id = 1;
-- Output: {1:1, 2:1, 3:1, 4:1, 5:1, 6:1, 7:1, 8:1}
-- Bm25vector based on the custom model and pipeline

3. 使用 Jieba 处理中文文本

CREATE TABLE chinese_docs (
    id SERIAL PRIMARY KEY,
    passage TEXT,
    embedding bm25vector
);

-- Define a text analyzer using the Jieba pre-tokenizer
SELECT create_text_analyzer('jieba_analyzer', $$
[pre_tokenizer.jieba]
# Optional Jieba configurations can go here
$$);

-- Create tokenizer, custom model, and trigger for Chinese text
SELECT create_custom_model_tokenizer_and_trigger(
    tokenizer_name => 'chinese_tokenizer',
    model_name => 'chinese_model',
    text_analyzer_name => 'jieba_analyzer',
    table_name => 'chinese_docs',
    source_column => 'passage',
    target_column => 'embedding'
);

-- Insert Chinese text
INSERT INTO chinese_docs (passage) VALUES
('红海早过了，船在印度洋面上开驶着。'); -- Example sentence

SELECT embedding FROM chinese_docs WHERE id = 1;
-- Output: {1:1, 2:1, 3:1, 4:1, 5:1, 6:1, 7:1, 8:1, 9:1, 10:1, 11:1, 12:1}
-- Bm25vector based on Jieba segmentation

参考

pg_tokenizer：https://github.com/tensorchord/pg_tokenizer.rs

介绍

示例

参考

搜索

分类

标签