Building a Custom Knowledge Base
A custom knowledge base improves the accuracy of natural language queries by giving the LLM domain-specific context about your tables, columns, business rules, and query patterns. When a user asks a question in natural language, the LLM searches the knowledge base for relevant context before generating SQL. This tutorial walks you through the entire process of creating, building, and deploying a custom knowledge base.
How Custom Knowledge Bases Improve Queries
The knowledge base provides a feedback loop between your domain documentation and the LLM query engine. The following steps describe how the system uses this loop.
-
You document your schema, including tables, columns, relationships, and business rules.
-
The
pgedge-nla-kb-buildertool processes the documentation into searchable chunks with embeddings. -
The MCP server loads the custom knowledge base alongside or instead of the standard knowledge base.
-
When a user asks a natural language question, the LLM calls
search_knowledgebaseto find relevant domain context. -
The LLM uses the retrieved context to generate more accurate SQL for the question.
The more thoroughly you document your domain, the better the LLM performs when answering questions about your data.
Creating Domain Documentation
Domain documentation describes your database schema, business rules, and common query patterns. You write the documentation as Markdown files and store the files in a dedicated directory.
Writing Schema Documentation
Schema documentation describes each table, its columns, and the data the table contains. Create one Markdown file per table group or major topic.
In the following example, the Markdown file documents an e-commerce database schema with tables and common queries:
# E-Commerce Database Schema
## Orders Table
The `orders` table stores all customer purchase records.
| Column | Type | Description |
|--------|------|-------------|
| id | SERIAL | The unique order identifier. |
| customer_id | INTEGER | References the customers table. |
| status | VARCHAR(20) | The order status value. |
| total_amount | NUMERIC(10,2) | The order total in USD. |
| created_at | TIMESTAMPTZ | The order creation timestamp. |
### Common Queries
Find orders by status:
` ` `sql
SELECT * FROM orders WHERE status = 'pending';
` ` `
Calculate daily revenue:
` ` `sql
SELECT DATE(created_at) AS day, SUM(total_amount)
FROM orders
WHERE status != 'cancelled'
GROUP BY DATE(created_at);
` ` `
Writing Business Rules Documentation
Business rules documentation defines domain-specific terms and metrics that the LLM needs to understand. The LLM uses these definitions to translate natural language into correct SQL.
In the following example, the Markdown file documents revenue metrics and customer status definitions:
# Business Rules and Glossary
## Revenue Metrics
Net revenue equals the sum of order amounts excluding
cancelled and refunded orders. Gross revenue equals the
sum of all order amounts including cancelled orders.
Average order value (AOV) equals net revenue divided
by the count of completed orders.
## Status Definitions
An active customer has at least one order in the last
90 days. A churned customer has no orders in the last
180 days.
Writing Relationship Documentation
Relationship documentation describes how tables connect through foreign keys and join patterns. The LLM uses this context to generate correct JOIN clauses.
In the following example, the Markdown file documents a one-to-many relationship between customers and orders:
# Table Relationships
## Customer to Orders (One-to-Many)
Each customer can have many orders. Join customers to
their orders using the customer_id column.
` ` `sql
SELECT c.name, o.id, o.total_amount
FROM customers c
JOIN orders o ON o.customer_id = c.id;
` ` `
Supported File Formats
The KB builder processes documentation in several formats. The builder supports the following file types:
- Markdown (
.md) files work best for new documentation. - HTML (
.html,.htm) files support existing web documentation. - reStructuredText (
.rst) files support Sphinx-based documentation. - SGML (
.sgml,.sgm) files support PostgreSQL-style documentation. - DocBook XML (
.xml) files support XML-based documentation.
The builder converts all formats to Markdown before chunking and embedding the content.
Configuring the KB Builder
The KB builder reads a YAML configuration file that specifies the documentation sources and the embedding providers. The following steps walk you through the configuration process.
Step-by-Step Configuration
Follow these steps to configure and run the KB builder.
-
Create a directory for your domain documentation:
mkdir -p ~/my-project/docs -
Write your documentation files as described in the previous section. Place the files in the
docsdirectory you created. -
Create the KB builder configuration file. In the following example, the configuration uses OpenAI for embeddings:
# Output database file path database_path: "my-project-kb.db" # Directory for storing processed documentation doc_source_path: "doc-source" # Documentation sources to process sources: - local_path: "~/my-project" doc_path: "docs" project_name: "My E-Commerce App" project_version: "1.0" # Embedding provider configuration embeddings: openai: enabled: true api_key_file: "~/.openai-api-key" model: "text-embedding-3-small" dimensions: 1536 voyage: enabled: false ollama: enabled: false -
Set up your API key file for the embedding provider:
echo "sk-your-openai-key" > ~/.openai-api-key chmod 600 ~/.openai-api-key -
Build the knowledge base by running the builder:
./pgedge-nla-kb-builder \ --config my-kb-builder.yaml
The builder processes each documentation file, splits the content into chunks, generates embeddings, and stores the results in the output SQLite database.
Using Multiple Sources
You can combine your custom documentation with standard PostgreSQL documentation in a single knowledge base. The following configuration includes both a local source and a Git repository source:
sources:
# Your domain documentation
- local_path: "~/my-project"
doc_path: "docs"
project_name: "My E-Commerce App"
project_version: "1.0"
# PostgreSQL documentation for SQL reference
- git_url: "https://github.com/postgres/postgres.git"
branch: "REL_17_STABLE"
doc_path: "doc/src/sgml"
project_name: "PostgreSQL"
project_version: "17"
The builder clones the Git repository and processes both sources into the same database. The LLM can then search across all projects or filter by project name.
Using Ollama for Local Builds
Ollama provides a local embedding option that requires no API key. Use Ollama when you want to build knowledge bases without sending data to external services.
The following configuration enables Ollama as the embedding provider:
embeddings:
openai:
enabled: false
voyage:
enabled: false
ollama:
enabled: true
endpoint: "http://localhost:11434"
model: "nomic-embed-text"
Before building, pull the embedding model with the following command:
ollama pull nomic-embed-text
Then run the builder using the same build command described in the step-by-step section.
Configuring the MCP Server
After building the knowledge base, configure the MCP server
to load the database file. Add the knowledgebase section to
your server configuration file.
The following example shows the server configuration for a custom knowledge base built with OpenAI embeddings:
knowledgebase:
enabled: true
database_path: "./my-project-kb.db"
embedding_provider: "openai"
embedding_model: "text-embedding-3-small"
embedding_openai_api_key_file: "~/.openai-api-key"
The embedding provider and model in the server configuration must match the provider and model you used to build the knowledge base. The server uses the same embedding model to convert search queries into vectors for comparison.
For Ollama-based knowledge bases, use the following configuration:
knowledgebase:
enabled: true
database_path: "./my-project-kb.db"
embedding_provider: "ollama"
embedding_model: "nomic-embed-text"
Testing and Verifying Your Knowledge Base
After deploying the knowledge base, verify that the server loads the data correctly and returns relevant results. Follow these steps to test the setup.
-
Start the MCP server with the custom knowledge base configuration.
-
List the available products in the knowledge base. Use
search_knowledgebasewithlist_productsset totrue:{ "list_products": true } -
Confirm that your project name appears in the product list. The output should include the project name and version you specified in the builder configuration.
-
Search for a domain-specific term to verify the content. In the following example, the search targets your custom project:
{ "query": "how to calculate net revenue", "project_names": ["My E-Commerce App"] } -
Verify that the results contain relevant chunks from your documentation.
-
Ask a natural language question through your MCP client and observe whether the LLM uses the knowledge base context to generate accurate SQL.
If the search returns no results, confirm that the embedding provider in the server configuration matches the provider you used during the build.
Incremental Updates
The KB builder supports incremental processing when you update your documentation. The builder reprocesses only the files that changed since the last build.
Follow these steps to update an existing knowledge base.
-
Edit or add documentation files in your source directory.
-
Run the builder again with the same configuration file:
./pgedge-nla-kb-builder \ --config my-kb-builder.yaml
The builder pulls the latest changes from Git repositories and reprocesses only modified files. Unchanged files reuse their existing chunks and embeddings.
Use the --skip-updates flag to skip Git pull operations
during development:
./pgedge-nla-kb-builder \
--config my-kb-builder.yaml --skip-updates
If you enable a new embedding provider after the initial
build, use --add-missing-embeddings to generate embeddings
for the new provider without reprocessing documents:
./pgedge-nla-kb-builder \
--config my-kb-builder.yaml \
--add-missing-embeddings
Knowledge Base Database Schema
The KB builder stores data in a SQLite database with a single main table. You do not need to interact with the database directly; the MCP server handles all queries.
The chunks table stores text chunks with metadata and
embeddings. The following SQL shows the table schema:
CREATE TABLE chunks (
id INTEGER PRIMARY KEY AUTOINCREMENT,
text TEXT NOT NULL,
title TEXT,
section TEXT,
project_name TEXT NOT NULL,
project_version TEXT NOT NULL,
file_path TEXT,
openai_embedding BLOB,
voyage_embedding BLOB,
ollama_embedding BLOB,
created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);
The table includes the following indexes for fast filtering:
CREATE INDEX idx_project
ON chunks(project_name, project_version);
CREATE INDEX idx_title ON chunks(title);
CREATE INDEX idx_section ON chunks(section);
Each embedding column stores a float32 array serialized as a binary BLOB. The server deserializes the embeddings at query time and computes similarity scores against the query vector.
Best Practices for Domain Documentation
Follow these guidelines to maximize the accuracy of natural language queries against your custom knowledge base.
- Document every table with its purpose and column descriptions.
- Include example queries for common business questions.
- Define business terms and domain jargon in a glossary file.
- Document join patterns between related tables.
- Include sample data to illustrate expected column values.
- Keep the documentation current when the schema changes.
- Use one Markdown file per major topic or table group.
- Include both simple and complex query examples.
- Write clear column descriptions that distinguish similarly named columns.
- Document enum values and status codes with their meanings.
See Also
The following pages provide additional reference material for knowledge base configuration and usage.
- Knowledgebase Configuration covers the server-side knowledge base settings.
- KB Builder Configuration provides the complete builder configuration reference.
- Available Tools documents all
MCP tools including
search_knowledgebase. - Querying Best Practices describes techniques for effective natural language queries.