Skip to main content

Enable Semantic Search

Prerequisites

  • OpenSearch as your search backend (Elasticsearch is not supported)
  • An external embedding provider: OpenAI or AWS Bedrock, or DJL for HuggingFace models.
  • Network access from the OpenMetadata server to the embedding provider API (unless using DJL)

Overview

Semantic Search enhances OpenMetadata’s search capabilities by using vector embeddings to understand the meaning behind queries, rather than relying solely on keyword matching. This means users and AI agents can search using natural language — for example, “tables with customer demographics and purchase history” — and get meaningful results even if those exact words don’t appear in the metadata.
Semantic Search is currently supported only with OpenSearch as the search backend.
Semantic Search also powers the Semantic Search MCP tool, enabling AI assistants connected via the Model Context Protocol to perform natural language queries against your metadata catalog.

How It Works

1

Text Construction

For each entity, a structured text representation is constructed from its metadata — including name, description, entity type, tags, glossary terms, owners, and other relevant fields.
2

Embedding Generation & Vector Indexing

The text is sent to the configured embedding provider to generate a numerical vector (embedding), which is stored in a dedicated OpenSearch vector_search_index using the HNSW algorithm with cosine similarity. At query time, the search text is also embedded and a KNN (K-Nearest Neighbor) similarity search finds the most relevant results.
3

Automatic Lifecycle Management

Embeddings follow the same lifecycle as the entities themselves. When entities are created, updated, deleted, or restored, their embeddings are automatically kept in sync using the same indexing strategies the platform already uses for search. No manual intervention is required after initial setup.

Supported Entity Types

table, glossary, glossaryTerm, chart, dashboard, dashboardDataModel, database, databaseSchema, dataProduct, pipeline, mlmodel, metric, apiEndpoint, apiCollection, page, storedProcedure, searchIndex, topic

Configuration

Semantic Search is configured in openmetadata.yaml under the elasticsearch.naturalLanguageSearch section. All settings can be overridden with environment variables.

Enable Semantic Search

Environment VariableDefaultDescription
SEMANTIC_SEARCH_ENABLEDfalseMaster switch to enable semantic search
EMBEDDING_PROVIDERbedrockEmbedding provider to use: openai, bedrock, or djl
elasticsearch:
  naturalLanguageSearch:
    semanticSearchEnabled: ${SEMANTIC_SEARCH_ENABLED:-false}
    embeddingProvider: ${EMBEDDING_PROVIDER:-bedrock}

Embedding Providers

Choose one of the following embedding providers and configure it accordingly.
Supports both OpenAI and Azure OpenAI endpoints.
Environment VariableDefaultDescription
OPENAI_API_KEY""Your OpenAI API key
OPENAI_API_ENDPOINT""API endpoint. For Azure, use https://your-resource.openai.azure.com
OPENAI_DEPLOYMENT_NAME""Deployment name (required for Azure OpenAI)
OPENAI_API_VERSION2024-02-01API version (Azure OpenAI)
OPENAI_EMBEDDING_MODEL_IDtext-embedding-3-smallEmbedding model to use
OPENAI_EMBEDDING_DIMENSION1536Embedding vector dimension
elasticsearch:
  naturalLanguageSearch:
    semanticSearchEnabled: true
    embeddingProvider: openai
    openai:
      apiKey: ${OPENAI_API_KEY:-""}
      endpoint: ${OPENAI_API_ENDPOINT:-""}
      deploymentName: ${OPENAI_DEPLOYMENT_NAME:-""}
      apiVersion: ${OPENAI_API_VERSION:-"2024-02-01"}
      embeddingModelId: ${OPENAI_EMBEDDING_MODEL_ID:-"text-embedding-3-small"}
      embeddingDimension: ${OPENAI_EMBEDDING_DIMENSION:-1536}

Docker Deployment

To enable Semantic Search in a Docker deployment, set the required environment variables in your docker-compose override or .env file:
environment:
  SEMANTIC_SEARCH_ENABLED: "true"
  EMBEDDING_PROVIDER: "openai"
  OPENAI_API_KEY: "sk-..."
  OPENAI_EMBEDDING_MODEL_ID: "text-embedding-3-small"
  OPENAI_EMBEDDING_DIMENSION: "1536"

Kubernetes Deployment

For Kubernetes deployments using the OpenMetadata Helm chart, add the environment variables to your values.yaml:
openmetadata:
  config:
    extraEnvs:
      - name: SEMANTIC_SEARCH_ENABLED
        value: "true"
      - name: EMBEDDING_PROVIDER
        value: "openai"
      - name: OPENAI_API_KEY
        valueFrom:
          secretKeyRef:
            name: openmetadata-secrets
            key: openai-api-key
      - name: OPENAI_EMBEDDING_MODEL_ID
        value: "text-embedding-3-small"
      - name: OPENAI_EMBEDDING_DIMENSION
        value: "1536"
Store sensitive values like API keys in Kubernetes Secrets and reference them with secretKeyRef rather than hardcoding them in values.yaml.

Validating the Configuration

After configuring your embedding provider, you can verify that everything is set up correctly by navigating to Settings > Preferences > Health in the OpenMetadata UI. This page shows the status of the embedding provider connection and will flag any misconfiguration.

Generating Embeddings

Once Semantic Search is enabled, embeddings are generated and kept in sync automatically as entities are created or updated. To generate embeddings for all existing entities, run a Reindex from the OpenMetadata UI (Settings > Applications > Search Indexing). Every Reindex operation computes embeddings taking a fingerprint into account — if the text representation of an entity has not changed since its last embedding, the embedding is not recomputed. This avoids unnecessary calls to the embedding provider and makes re-indexing efficient even for large catalogs.

API Reference

Semantic Search exposes a REST API endpoint for vector queries:

POST /api/v1/search/vector/query

Performs a semantic search against the vector index. Request Body:
{
  "query": "customer demographics purchase history",
  "filters": {
    "entityType": ["table"],
    "owners": ["admin"],
    "tags": ["PII.Sensitive"],
    "domains": ["Marketing"],
    "tier": ["Tier.Tier1"],
    "serviceType": ["Postgres"]
  },
  "size": 10,
  "k": 1000,
  "threshold": 0.0
}
ParameterTypeDefaultDescription
querystring(required)Natural language search text
filtersmap{}Filter map by entity type, owners, tags, domains, tier, service type, certification, or custom properties
sizeint10Number of distinct entities to return (max 100)
kint500KNN parameter — number of nearest neighbors to consider (max 10,000)
thresholddouble0.0Minimum similarity score to include in results
Results are deduplicated by parent entity, so you will receive at most size distinct entities even if an entity has multiple text chunks.

Troubleshooting

Semantic Search returns no results

  • Verify that SEMANTIC_SEARCH_ENABLED is set to true and the server has been restarted.
  • Confirm that OpenSearch is your search backend (Elasticsearch is not supported).
  • Check that the vector_search_index exists in OpenSearch.
  • Run a Reindex to generate embeddings for existing entities.

Embedding generation fails

  • Verify network connectivity from the OpenMetadata server to your embedding provider.
  • Check that API keys and credentials are correct.
  • Review the OpenMetadata server logs for detailed error messages.