API Reference
Functions
enable_vectorization()
Enable automatic vectorization for a table column.
SELECT pgedge_vectorizer.enable_vectorization(
source_table REGCLASS,
source_column NAME,
chunk_strategy TEXT DEFAULT NULL,
chunk_size INT DEFAULT NULL,
chunk_overlap INT DEFAULT NULL,
embedding_dimension INT DEFAULT NULL,
chunk_table_name TEXT DEFAULT NULL,
source_pk NAME DEFAULT NULL
);
Parameters:
source_table: Table to vectorizesource_column: Column containing textchunk_strategy: Chunking method (token_based, semantic, markdown)chunk_size: Target chunk size in tokenschunk_overlap: Overlap between chunks in tokensembedding_dimension: Vector dimension. When NULL (the default), the dimension is auto-detected by making a probe call to the configured embedding provider/model. Can be set explicitly to override auto-detection.chunk_table_name: Custom chunk table name (default:{table}_{column}_chunks)source_pk: Primary key column to use as the document identifier in the chunk table. When NULL (the default), the primary key column name and type are auto-detected from the table's primary key index viapg_index. Set explicitly to use a specific column (e.g.,'external_id').
Primary Key Handling:
- Auto-detection: When
source_pkis NULL, the primary key column name and type are detected frompg_index. The chunk table'ssource_idcolumn is created with the matching type (e.g.,UUID,BIGINT,TEXT,VARCHAR(26)). - Supported types: Any single-column primary key type —
UUID,BIGSERIAL/BIGINT,SERIAL/INTEGER,TEXT,VARCHAR(n), etc. - Composite keys: Auto-detection does not support composite (multi-column) primary keys. However, you can vectorize a composite-PK table by passing
source_pkto select one column explicitly (e.g.,source_pk := 'item_id'). - No primary key: Tables without a primary key must specify
source_pkexplicitly. - Override: Pass
source_pkto use a different column than the table's actual primary key (e.g., anexternal_id UUIDcolumn). - Uniqueness requirement: The column specified by
source_pkmust contain globally unique values. The chunk table enforces aUNIQUE(source_id, chunk_index)constraint, so duplicatesource_pkvalues will cause conflicts.
Behavior:
- Creates chunk table, indexes, and trigger automatically
- Automatically processes all existing rows with non-empty content
- Future INSERT/UPDATE operations will be automatically vectorized
- Multiple columns can be vectorized independently on the same table
Content Handling:
- Whitespace trimming: Leading and trailing whitespace is automatically trimmed before processing
- Empty content: NULL, empty strings, or whitespace-only content will not create chunks
- Updates to empty: When content is updated to NULL or empty, existing chunks are deleted
- Unchanged content: UPDATE operations with identical content are skipped for efficiency
- Multiple columns: Each column gets its own chunk table (
{table}_{column}_chunks) and trigger
disable_vectorization()
Disable vectorization for a table column.
SELECT pgedge_vectorizer.disable_vectorization(
source_table REGCLASS,
source_column NAME DEFAULT NULL,
drop_chunk_table BOOLEAN DEFAULT FALSE
);
Parameters:
source_table: Table to disable vectorization onsource_column: Column to disable (NULL = disable all columns)drop_chunk_table: Whether to drop the chunk table
chunk_text()
Manually chunk text content.
SELECT pgedge_vectorizer.chunk_text(
content TEXT,
strategy TEXT DEFAULT NULL,
chunk_size INT DEFAULT NULL,
overlap INT DEFAULT NULL
);
Returns: TEXT[] array of chunks
generate_embedding()
Generate an embedding vector from query text.
SELECT pgedge_vectorizer.generate_embedding(
query_text TEXT
);
Parameters:
query_text: Text to generate an embedding for
Returns: vector - The embedding vector using the configured provider
Example:
-- Generate an embedding for a search query
SELECT
d.id,
c.content,
c.embedding <=> pgedge_vectorizer.generate_embedding('machine learning tutorials') AS distance
FROM documents d
JOIN documents_content_chunks c ON d.id = c.source_id
ORDER BY distance
LIMIT 5;
Note: This function calls the embedding provider synchronously, so it will wait for the API response. For large-scale batch operations, use the automatic vectorization features instead.
detect_embedding_dimension()
Detect the embedding dimension of the currently configured provider/model.
SELECT pgedge_vectorizer.detect_embedding_dimension();
Returns: INT - The number of dimensions in the embedding vector
This function generates a probe embedding using the configured provider and model, and returns the dimension of the resulting vector. It is called automatically by enable_vectorization() when embedding_dimension is not specified.
retry_failed()
Retry failed queue items.
SELECT pgedge_vectorizer.retry_failed(
max_age_hours INT DEFAULT 24
);
Returns: Number of items reset to pending
clear_completed()
Remove old completed items from queue.
SELECT pgedge_vectorizer.clear_completed(
older_than_hours INT DEFAULT 24
);
Returns: Number of items deleted
Note: Workers automatically clean up completed items based on pgedge_vectorizer.auto_cleanup_hours. Manual cleanup is only needed if you want to clean up more frequently or if automatic cleanup is disabled.
reprocess_chunks()
Queue existing chunks without embeddings for processing.
SELECT pgedge_vectorizer.reprocess_chunks(
chunk_table_name TEXT
);
Parameters:
chunk_table_name: Name of the chunk table to reprocess
Returns: Number of chunks queued
Example:
-- Reprocess chunks that don't have embeddings yet
SELECT pgedge_vectorizer.reprocess_chunks('product_docs_content_chunks');
recreate_chunks()
Delete all chunks and recreate from source table (complete rebuild).
SELECT pgedge_vectorizer.recreate_chunks(
source_table_name REGCLASS,
source_column_name NAME
);
Parameters:
source_table_name: Source table with the original datasource_column_name: Column that was vectorized
Returns: Number of source rows processed
Example:
-- Completely rebuild all chunks and embeddings
SELECT pgedge_vectorizer.recreate_chunks('product_docs', 'content');
Note: This function deletes all existing chunks and queue items, then triggers re-chunking and re-embedding for all rows. Use with caution.
hybrid_search()
Run a hybrid BM25 + dense vector search and merge results with Reciprocal Rank
Fusion (RRF). Requires pgedge_vectorizer.enable_hybrid = true.
SELECT * FROM pgedge_vectorizer.hybrid_search(
p_source_table REGCLASS,
p_query TEXT,
p_limit INT DEFAULT 10,
p_alpha FLOAT8 DEFAULT 0.7,
p_rrf_k INT DEFAULT 60,
p_source_column NAME DEFAULT NULL
);
Parameters:
p_source_table: The source table that was vectorizedp_query: Search query textp_limit: Maximum number of results to returnp_alpha: Balance between dense and sparse results (0.0= pure keyword,1.0= pure semantic,0.7= default)p_rrf_k: RRF smoothing constant (higher values reduce the influence of rank position)p_source_column: Source column name. Required when a table has multiple vectorized columns; optional otherwise
Returns a table with columns:
source_id(TEXT): Primary key of the source row (cast to text for compatibility with all PK types)chunk(TEXT): The matching text chunkdense_rank(INT): Rank from dense vector search (9999 if not found)sparse_rank(INT): Rank from BM25 keyword search (9999 if not found)rrf_score(FLOAT8): Combined RRF score (higher is better)
Example:
SELECT source_id, chunk, dense_rank, sparse_rank, rrf_score
FROM pgedge_vectorizer.hybrid_search(
p_source_table := 'articles'::regclass,
p_query := 'PostgreSQL replication',
p_limit := 5,
p_alpha := 0.7
);
hybrid_search_simple()
Convenience wrapper around hybrid_search() that returns only the source ID,
chunk text, and combined score.
SELECT * FROM pgedge_vectorizer.hybrid_search_simple(
p_source_table REGCLASS,
p_query TEXT,
p_limit INT DEFAULT 10,
p_source_column NAME DEFAULT NULL
);
Returns a table with columns: source_id, chunk, rrf_score.
Example:
SELECT * FROM pgedge_vectorizer.hybrid_search_simple(
'articles'::regclass, 'PostgreSQL replication', 5
);
bm25_query_vector()
Compute a BM25 sparse vector for a query string. Primarily used internally by
hybrid_search(), but available for advanced use cases.
SELECT pgedge_vectorizer.bm25_query_vector(
query TEXT,
chunk_table TEXT
);
Returns: sparsevec -- A sparse vector of BM25 scores using IDF statistics
from the specified chunk table.
bm25_avg_doc_len()
Return the average document length (in tokens) for a chunk table.
SELECT pgedge_vectorizer.bm25_avg_doc_len(chunk_table TEXT);
Returns: FLOAT8
bm25_tokenize()
Tokenize text using the BM25 tokenizer (lowercase, remove stopwords, deduplicate). Useful for debugging and testing.
SELECT pgedge_vectorizer.bm25_tokenize(query TEXT);
Returns: TEXT[] -- Array of distinct non-stopword terms.
show_config()
Display all pgedge_vectorizer configuration settings.
SELECT * FROM pgedge_vectorizer.show_config();
Returns a table with setting and value columns showing all GUC parameters.
Views
queue_status
Summary of queue items by status.
SELECT * FROM pgedge_vectorizer.queue_status;
Columns:
chunk_table: Table namestatus: Item statuscount: Number of itemsoldest: Oldest item timestampnewest: Newest item timestampavg_processing_time_secs: Average processing time
failed_items
Failed items with error details.
SELECT * FROM pgedge_vectorizer.failed_items;
pending_count
Count of pending items.
SELECT * FROM pgedge_vectorizer.pending_count;