> ## Documentation Index
> Fetch the complete documentation index at: https://docs.open-metadata.org/llms.txt
> Use this file to discover all available pages before exploring further.

# External Auto Classification Workflow

# Auto Classification Workflow Configuration

The Auto Classification Workflow enables automatic tagging of sensitive information within databases. Below are the configuration parameters available in the **Service Classification Pipeline JSON**.

## Pipeline Configuration Parameters

| **Parameter**                 | **Description**                                                                                                                                                                    | **Type** | **Default Value**    |
| ----------------------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | -------- | -------------------- |
| `type`                        | Specifies the pipeline type.                                                                                                                                                       | String   | `AutoClassification` |
| `classificationFilterPattern` | Regex to compute metrics for tables matching specific tags, tiers, or glossary patterns.                                                                                           | Object   | N/A                  |
| `schemaFilterPattern`         | Regex to fetch schemas matching the specified pattern.                                                                                                                             | Object   | N/A                  |
| `tableFilterPattern`          | Regex to exclude tables matching the specified pattern.                                                                                                                            | Object   | N/A                  |
| `databaseFilterPattern`       | Regex to fetch databases matching the specified pattern.                                                                                                                           | Object   | N/A                  |
| `includeViews`                | Option to include or exclude views during metadata ingestion.                                                                                                                      | Boolean  | `true`               |
| `useFqnForFiltering`          | Determines whether filtering is applied to the Fully Qualified Name (FQN) instead of raw names.                                                                                    | Boolean  | `false`              |
| `storeSampleData`             | Controls whether sampled rows are persisted to OpenMetadata after classification. Rows are always sampled for classification regardless of this setting.                           | Boolean  | `false`              |
| `enableAutoClassification`    | Enables automatic tagging of columns that might contain sensitive information.                                                                                                     | Boolean  | `true`               |
| `confidence`                  | Confidence level for tagging columns as sensitive. Value ranges from 0 to 100.                                                                                                     | Number   | `80`                 |
| `sampleDataCount`             | Maximum rows sampled per table for classification. Capped at 50 — setting a higher value has no effect. `storeSampleData` controls whether sampled rows are saved to OpenMetadata. | Integer  | `50`                 |
| `classificationLanguage`      | Language for auto-classification recognizers. Use `any` to run all recognizers regardless of language. For a specific language, only matching recognizers run.                     | String   | `en`                 |

## Key Parameters Explained

### `enableAutoClassification`

* Set this to `true` to enable automatic detection of sensitive columns (e.g., PII).
* Applies pattern recognition and tagging based on predefined criteria.

### `confidence`

* Confidence level for tagging sensitive columns:
  * A higher confidence value (e.g., `90`) reduces false positives but may miss some sensitive data.
  * A lower confidence value (e.g., `70`) identifies more sensitive columns but may result in false positives.

### `storeSampleData`

* Controls whether sampled rows are **persisted** to OpenMetadata after classification.
* Rows are always sampled for classification regardless of this setting — up to 50 rows, or fewer if `sampleDataCount` is set lower. `storeSampleData` only decides whether those rows are also saved to OpenMetadata.
* Defaults to `false`.

### `classificationFilterPattern`

Use this to scope the auto-classification run to only tables carrying a specific tag.

The value you provide must match the **tag name** or **tag FQN** depending on the `useFqnForFiltering` setting:

* **Default (`useFqnForFiltering: false`)** — match against the **tag name only**. For example, if your tag is `POV.Key Data Asset`, use `key data asset`:

  ```yaml theme={null}
  classificationFilterPattern:
    includes:
      - "key data asset"
  ```

* **When `useFqnForFiltering: true`** — match against the full tag FQN in the format `Classification.TagName`:

  ```yaml theme={null}
  classificationFilterPattern:
    includes:
      - "POV.Key Data Asset"
  ```

<Warning>
  **Important**: Passing the FQN format (for example, `POV.Key Data Asset`) when `useFqnForFiltering` is `false` (the default) will cause the filter to match nothing — all records will be skipped and the run will report zero classified results.
</Warning>

### `useFqnForFiltering`

* When set to `true`, filtering patterns — including `classificationFilterPattern` — are matched against the Fully Qualified Name (for example, `Classification.TagName`).
* When set to `false` (default), filtering matches against raw names only (for example, the tag name without the classification prefix).

## Auto Classification Workflow Execution

To execute the **Auto Classification Workflow**, follow the steps below:

### 1. Install the Required Python Package

Ensure you have the correct OpenMetadata ingestion package installed, including the **PII Processor** module:

```bash theme={null}
pip install "openmetadata-ingestion[pii-processor]"
```

## 2. Define and Execute the Python Workflow

Instead of using a YAML configuration, use the AutoClassificationWorkflow from OpenMetadata to trigger the ingestion process programmatically.

## Sample Auto Classification Workflow yaml

```yaml theme={null}
source:
  type: bigquery
  serviceName: local_bigquery
  serviceConnection:
    config:
      type: BigQuery
      credentials:
        gcpConfig:
          type: service_account
          projectId: my-project-id-1234
          privateKeyId: privateKeyID
          privateKey: "-----BEGIN PRIVATE KEY-----\nmySuperSecurePrivateKey==\n-----END PRIVATE KEY-----\n"
          clientEmail: client@email.secure
          clientId: "1234567890"
          authUri: https://accounts.google.com/o/oauth2/auth
          tokenUri: https://oauth2.googleapis.com/token
          authProviderX509CertUrl: https://www.googleapis.com/oauth2/v1/certs
          clientX509CertUrl: https://www.googleapis.com/oauth2/v1/certs
  sourceConfig:
    config:
      type: AutoClassification
      storeSampleData: false
      enableAutoClassification: true
      databaseFilterPattern:
        includes:
          - hello-world-1234
      schemaFilterPattern:
        includes:
          - super_schema
      tableFilterPattern:
        includes:
          - abc

processor:
  type: "tag-pii-processor"
  config: {}

sink:
  type: metadata-rest
  config: {}
workflowConfig:
#  loggerLevel: INFO # DEBUG, INFO, WARN or ERROR
  openMetadataServerConfig:
    hostPort: http://localhost:8585/api
    authProvider: openmetadata
    securityConfig:
      jwtToken: "eyJraWQiOiJHYjM4OWEtOWY3Ni1nZGpzLWE5MmotMDI0MmJrOTQzNTYiLCJ0eXAiOiJKV1QiLCJhbGciOiJSUzI1NiJ9.eyJzdWIiOiJhZG1pbiIsImlzQm90IjpmYWxzZSwiaXNzIjoib3Blbi1tZXRhZGF0YS5vcmciLCJpYXQiOjE2NjM5Mzg0NjIsImVtYWlsIjoiYWRtaW5Ab3Blbm1ldGFkYXRhLm9yZyJ9.tS8um_5DKu7HgzGBzS1VTA5uUjKWOCU0B_j08WXBiEC0mr0zNREkqVfwFDD-d24HlNEbrqioLsBuFRiwIWKc1m_ZlVQbG7P36RUxhuv2vbSp80FKyNM-Tj93FDzq91jsyNmsQhyNv_fNr3TXfzzSPjHt8Go0FMMP66weoKMgW2PbXlhVKwEuXUHyakLLzewm9UMeQaEiRzhiTMU3UkLXcKbYEJJvfNFcLwSl9W8JCO_l0Yj3ud-qt_nQYEZwqW6u5nfdQllN133iikV4fM5QZsMCnm8Rq1mvLR0y9bmJiD7fwM1tmJ791TUWqmKaTnP49U493VanKpUAfzIiOiIbhg"
```

### 3. Expected Outcome

* Automatically classifies and tags sensitive data based on predefined patterns and confidence levels.
* Improves metadata enrichment and enhances data governance practices.
* Provides visibility into sensitive data across databases.
  This approach ensures that the Auto Classification Workflow is executed correctly using the appropriate OpenMetadata ingestion framework.

## Auto Classification

The Auto Classification workflow will be using the `orm-profiler` processor.

After running a Metadata Ingestion workflow, we can run the Auto Classification workflow.
While the `serviceName` will be the same to that was used in Metadata Ingestion, so the ingestion bot can get the `serviceConnection` details from the server.

### 1. Define the YAML Config

This is a sample config for the Auto Classification Workflow:

<CodePreview>
  <ContentPanel>
    <ContentSection id={1} title="Source Configuration" lines="1-3">
      Configure the source type and service name for your auto classification workflow.
    </ContentSection>

    <ContentSection id={2} title="Auto Classification Config Type" lines="4-6">
      **type**: Set to `AutoClassification` for automatic PII tagging.
    </ContentSection>

    <ContentSection id={3} title="Store Sample Data" lines="7">
      **storeSampleData**: Option to turn on/off storing sample data. If enabled, we will ingest sample data for each table.
    </ContentSection>

    <ContentSection id={4} title="Enable Auto Classification" lines="8">
      **enableAutoClassification**: Optional configuration to automatically tag columns that might contain sensitive information.
    </ContentSection>

    <ContentSection id={5} title="Confidence" lines="9">
      **confidence**: Set the Confidence value for which you want the column to be tagged as PII. Confidence value ranges from 0 to 100. A higher number will yield less false positives but more false negatives. A lower number will yield more false positives but less false negatives.
    </ContentSection>

    <ContentSection id={6} title="Database Filter Pattern" lines="10-15">
      **databaseFilterPattern**: Regex to only fetch databases that matches the pattern.
    </ContentSection>

    <ContentSection id={7} title="Schema Filter Pattern" lines="16-21">
      **schemaFilterPattern**: Regex to only fetch tables or databases that matches the pattern.
    </ContentSection>

    <ContentSection id={8} title="Table Filter Pattern" lines="22-27">
      **tableFilterPattern**: Regex to only fetch tables or databases that matches the pattern.
    </ContentSection>

    <ContentSection id={9} title="Processor Configuration" lines="28-30">
      Choose the `orm-profiler`. Its config can also be updated to define tests from the YAML itself instead of the UI.

      **tableConfig**: `tableConfig` allows you to set up some configuration at the table level.
    </ContentSection>

    <ContentSection id={10} title="Sink Configuration" lines="31-33">
      To send the metadata to OpenMetadata, it needs to be specified as `type: metadata-rest`.
    </ContentSection>
  </ContentPanel>

  <CodePanel fileName="{connector}_auto_classification.yaml">
    ```yaml theme={null}
    source:
      type: snowflake
      serviceName: snowflake
      sourceConfig:
        config:
          type: AutoClassification
          # storeSampleData: true
          # enableAutoClassification: true
          # confidence: 80
          # databaseFilterPattern:
          #   includes:
          #     - database1
          #     - database2
          #   excludes:
          #     - database3
          # schemaFilterPattern:
          #   includes:
          #     - schema1
          #     - schema2
          #   excludes:
          #     - schema3
          # tableFilterPattern:
          #   includes:
          #     - table1
          #     - table2
          #   excludes:
          #     - table3
    processor:
      type: orm-profiler
      config: {}
    sink:
      type: metadata-rest
      config: {}
    ```
  </CodePanel>
</CodePreview>

### 2. Run with the CLI

After saving the YAML config, we will run the command the same way we did for the metadata ingestion:

```bash theme={null}
metadata classify -c <path-to-yaml>
```

<Tip>
  Now instead of running `ingest`, we are using the `classify` command to select the Auto Classification workflow.
</Tip>

## Workflow Execution

### To Execute the Auto Classification Workflow:

1. **Create a Pipeline**
   * Configure the Auto Classification JSON as demonstrated in the provided configuration example.

2. **Run the Ingestion Pipeline**
   * Use OpenMetadata or an external scheduler like Argo to trigger the pipeline execution.

3. **Validate Results**
   * Verify the metadata and tags applied to sensitive columns in the OpenMetadata UI.

### Expected Outcomes

* **Automatic Tagging:**
  Columns containing sensitive information (e.g., names, emails, SSNs) are automatically tagged based on predefined confidence levels.

* **Enhanced Visibility:**
  Gain improved visibility and classification of sensitive data within your databases.

* **Sample Data Integration:**
  Store sample data to provide better insights during profiling and testing workflows.