> ## Documentation Index
> Fetch the complete documentation index at: https://docs.open-metadata.org/llms.txt
> Use this file to discover all available pages before exploring further.

# Enable Semantic Search | OpenMetadata Deployment Guide

> Configure semantic search with vector embeddings in OpenMetadata to enable natural language queries against your metadata catalog using OpenSearch.

# Enable Semantic Search

## Prerequisites

* **OpenSearch** as your search backend (Elasticsearch is not supported)
* An external embedding provider: **OpenAI** or **AWS Bedrock**, or **DJL** for HuggingFace models.
* Network access from the OpenMetadata server to the embedding provider API (unless using DJL)

## Overview

Semantic Search enhances OpenMetadata's search capabilities by using **vector embeddings** to understand the meaning behind
queries, rather than relying solely on keyword matching. This means users and AI agents can search using natural language
\-- for example, *"tables with customer demographics and purchase history"* -- and get meaningful results even if those
exact words don't appear in the metadata.

<Info>
  Semantic Search is currently supported only with **OpenSearch** as the search backend.
</Info>

Semantic Search also powers the [Semantic Search MCP tool](/v1.12.x/how-to-guides/mcp/semantic-search),
enabling AI assistants connected via the Model Context Protocol to perform natural language queries against your
metadata catalog.

## How It Works

<Steps>
  <Step title="Text Construction">
    For each entity, a structured text representation is constructed from its metadata -- including name, description,
    entity type, tags, glossary terms, owners, and other relevant fields.
  </Step>

  <Step title="Embedding Generation & Vector Indexing">
    The text is sent to the configured embedding provider to generate a numerical vector (embedding), which is stored
    in a dedicated OpenSearch `vector_search_index` using the HNSW algorithm with cosine similarity. At query time,
    the search text is also embedded and a KNN (K-Nearest Neighbor) similarity search finds the most relevant results.
  </Step>

  <Step title="Automatic Lifecycle Management">
    Embeddings follow the same lifecycle as the entities themselves. When entities are created, updated, deleted, or
    restored, their embeddings are automatically kept in sync using the same indexing strategies the platform already
    uses for search. No manual intervention is required after initial setup.
  </Step>
</Steps>

### Supported Entity Types

`table`, `glossary`, `glossaryTerm`, `chart`, `dashboard`, `dashboardDataModel`, `database`, `databaseSchema`,
`dataProduct`, `pipeline`, `mlmodel`, `metric`, `apiEndpoint`, `apiCollection`, `page`, `storedProcedure`,
`searchIndex`, `topic`

## Configuration

Semantic Search is configured in `openmetadata.yaml` under the `elasticsearch.naturalLanguageSearch` section.
All settings can be overridden with environment variables.

### Enable Semantic Search

| Environment Variable      | Default   | Description                                              |
| ------------------------- | --------- | -------------------------------------------------------- |
| `SEMANTIC_SEARCH_ENABLED` | `false`   | Master switch to enable semantic search                  |
| `EMBEDDING_PROVIDER`      | `bedrock` | Embedding provider to use: `openai`, `bedrock`, or `djl` |

```yaml theme={null}
elasticsearch:
  naturalLanguageSearch:
    semanticSearchEnabled: ${SEMANTIC_SEARCH_ENABLED:-false}
    embeddingProvider: ${EMBEDDING_PROVIDER:-bedrock}
```

### Embedding Providers

Choose one of the following embedding providers and configure it accordingly.

<Tabs>
  <Tab title="OpenAI">
    Supports both OpenAI and Azure OpenAI endpoints.

    | Environment Variable         | Default                  | Description                                                           |
    | ---------------------------- | ------------------------ | --------------------------------------------------------------------- |
    | `OPENAI_API_KEY`             | `""`                     | Your OpenAI API key                                                   |
    | `OPENAI_API_ENDPOINT`        | `""`                     | API endpoint. For Azure, use `https://your-resource.openai.azure.com` |
    | `OPENAI_DEPLOYMENT_NAME`     | `""`                     | Deployment name (required for Azure OpenAI)                           |
    | `OPENAI_API_VERSION`         | `2024-02-01`             | API version (Azure OpenAI)                                            |
    | `OPENAI_EMBEDDING_MODEL_ID`  | `text-embedding-3-small` | Embedding model to use                                                |
    | `OPENAI_EMBEDDING_DIMENSION` | `1536`                   | Embedding vector dimension                                            |

    ```yaml theme={null}
    elasticsearch:
      naturalLanguageSearch:
        semanticSearchEnabled: true
        embeddingProvider: openai
        openai:
          apiKey: ${OPENAI_API_KEY:-""}
          endpoint: ${OPENAI_API_ENDPOINT:-""}
          deploymentName: ${OPENAI_DEPLOYMENT_NAME:-""}
          apiVersion: ${OPENAI_API_VERSION:-"2024-02-01"}
          embeddingModelId: ${OPENAI_EMBEDDING_MODEL_ID:-"text-embedding-3-small"}
          embeddingDimension: ${OPENAI_EMBEDDING_DIMENSION:-1536}
    ```
  </Tab>

  <Tab title="AWS Bedrock">
    Uses AWS Bedrock for embedding generation.

    | Environment Variable              | Default | Description                |
    | --------------------------------- | ------- | -------------------------- |
    | `AWS_REGION`                      | `""`    | AWS region                 |
    | `AWS_ACCESS_KEY_ID`               | `""`    | AWS access key             |
    | `AWS_SECRET_ACCESS_KEY`           | `""`    | AWS secret access key      |
    | `AWS_BEDROCK_EMBED_MODEL_ID`      | `""`    | Bedrock embedding model ID |
    | `AWS_BEDROCK_EMBEDDING_DIMENSION` | `""`    | Embedding vector dimension |

    ```yaml theme={null}
    elasticsearch:
      naturalLanguageSearch:
        semanticSearchEnabled: true
        embeddingProvider: bedrock
        bedrock:
          awsConfig:
            region: ${AWS_REGION:-""}
            accessKeyId: ${AWS_ACCESS_KEY_ID:-""}
            secretAccessKey: ${AWS_SECRET_ACCESS_KEY:-""}
          embeddingModelId: ${AWS_BEDROCK_EMBED_MODEL_ID:-""}
          embeddingDimension: ${AWS_BEDROCK_EMBEDDING_DIMENSION:-""}
    ```
  </Tab>

  <Tab title="DJL">
    Uses [Deep Java Library](https://djl.ai/) to run embedding models locally. No external API calls required.

    <Warning>
      DJL downloads and runs the HuggingFace model in your server directly. This will have an impact on the necessary resources depending on the chosen model. If you are resource constrainted, use external providers.

      The example model we provide is a rather small one that fits development/testing use cases. In case of choosing DJL, choose a model that fits your use case.
    </Warning>

    | Environment Variable  | Default                                                             | Description                  |
    | --------------------- | ------------------------------------------------------------------- | ---------------------------- |
    | `DJL_EMBEDDING_MODEL` | `ai.djl.huggingface.pytorch/sentence-transformers/all-MiniLM-L6-v2` | HuggingFace model identifier |

    The embedding dimension is auto-detected from the model at startup. The default model `all-MiniLM-L6-v2`
    produces 384-dimensional vectors.

    ```yaml theme={null}
    elasticsearch:
      naturalLanguageSearch:
        semanticSearchEnabled: true
        embeddingProvider: djl
        djl:
          embeddingModel: ${DJL_EMBEDDING_MODEL:-"ai.djl.huggingface.pytorch/sentence-transformers/all-MiniLM-L6-v2"}
    ```
  </Tab>
</Tabs>

## Docker Deployment

To enable Semantic Search in a Docker deployment, set the required environment variables in your `docker-compose` override
or `.env` file:

```yaml theme={null}
environment:
  SEMANTIC_SEARCH_ENABLED: "true"
  EMBEDDING_PROVIDER: "openai"
  OPENAI_API_KEY: "sk-..."
  OPENAI_EMBEDDING_MODEL_ID: "text-embedding-3-small"
  OPENAI_EMBEDDING_DIMENSION: "1536"
```

## Kubernetes Deployment

For Kubernetes deployments using the OpenMetadata Helm chart, add the environment variables to your `values.yaml`:

```yaml theme={null}
openmetadata:
  config:
    extraEnvs:
      - name: SEMANTIC_SEARCH_ENABLED
        value: "true"
      - name: EMBEDDING_PROVIDER
        value: "openai"
      - name: OPENAI_API_KEY
        valueFrom:
          secretKeyRef:
            name: openmetadata-secrets
            key: openai-api-key
      - name: OPENAI_EMBEDDING_MODEL_ID
        value: "text-embedding-3-small"
      - name: OPENAI_EMBEDDING_DIMENSION
        value: "1536"
```

<Tip>
  Store sensitive values like API keys in Kubernetes Secrets and reference them with `secretKeyRef` rather than
  hardcoding them in `values.yaml`.
</Tip>

## Validating the Configuration

After configuring your embedding provider, you can verify that everything is set up correctly by navigating to
`Settings > Preferences > Health` in the OpenMetadata UI. This page shows the status of the embedding provider
connection and will flag any misconfiguration.

## Generating Embeddings

Once Semantic Search is enabled, embeddings are generated and kept in sync automatically as entities are created
or updated. To generate embeddings for all existing entities, run a **Reindex** from the OpenMetadata UI
(`Settings > Applications > Search Indexing`).

Every Reindex operation computes embeddings taking a fingerprint into account -- if the text representation of an entity
has not changed since its last embedding, the embedding is not recomputed. This avoids unnecessary calls to the
embedding provider and makes re-indexing efficient even for large catalogs.

## API Reference

Semantic Search exposes a REST API endpoint for vector queries:

### POST `/api/v1/search/vector/query`

Performs a semantic search against the vector index.

**Request Body:**

```json theme={null}
{
  "query": "customer demographics purchase history",
  "filters": {
    "entityType": ["table"],
    "owners": ["admin"],
    "tags": ["PII.Sensitive"],
    "domains": ["Marketing"],
    "tier": ["Tier.Tier1"],
    "serviceType": ["Postgres"]
  },
  "size": 10,
  "k": 1000,
  "threshold": 0.0
}
```

| Parameter   | Type   | Default      | Description                                                                                               |
| ----------- | ------ | ------------ | --------------------------------------------------------------------------------------------------------- |
| `query`     | string | *(required)* | Natural language search text                                                                              |
| `filters`   | map    | `{}`         | Filter map by entity type, owners, tags, domains, tier, service type, certification, or custom properties |
| `size`      | int    | `10`         | Number of distinct entities to return (max 100)                                                           |
| `k`         | int    | `500`        | KNN parameter -- number of nearest neighbors to consider (max 10,000)                                     |
| `threshold` | double | `0.0`        | Minimum similarity score to include in results                                                            |

Results are deduplicated by parent entity, so you will receive at most `size` distinct entities even if an entity has
multiple text chunks.

## Troubleshooting

### Semantic Search returns no results

* Verify that `SEMANTIC_SEARCH_ENABLED` is set to `true` and the server has been restarted.
* Confirm that OpenSearch is your search backend (Elasticsearch is not supported).
* Check that the `vector_search_index` exists in OpenSearch.
* Run a Reindex to generate embeddings for existing entities.

### Embedding generation fails

* Verify network connectivity from the OpenMetadata server to your embedding provider.
* Check that API keys and credentials are correct.
* Review the OpenMetadata server logs for detailed error messages.
