> ## Documentation Index
> Fetch the complete documentation index at: https://docs.open-metadata.org/llms.txt
> Use this file to discover all available pages before exploring further.

# External Auto Classification Workflow

# Auto Classification Workflow Configuration

The Auto Classification Workflow enables automatic tagging of sensitive information within databases. Below are the configuration parameters available in the **Service Classification Pipeline JSON**.

## Pipeline Configuration Parameters

| **Parameter**                 | **Description**                                                                                 | **Type** | **Default Value**    |
| ----------------------------- | ----------------------------------------------------------------------------------------------- | -------- | -------------------- |
| `type`                        | Specifies the pipeline type.                                                                    | String   | `AutoClassification` |
| `classificationFilterPattern` | Regex to compute metrics for tables matching specific tags, tiers, or glossary patterns.        | Object   | N/A                  |
| `schemaFilterPattern`         | Regex to fetch schemas matching the specified pattern.                                          | Object   | N/A                  |
| `tableFilterPattern`          | Regex to exclude tables matching the specified pattern.                                         | Object   | N/A                  |
| `databaseFilterPattern`       | Regex to fetch databases matching the specified pattern.                                        | Object   | N/A                  |
| `includeViews`                | Option to include or exclude views during metadata ingestion.                                   | Boolean  | `true`               |
| `useFqnForFiltering`          | Determines whether filtering is applied to the Fully Qualified Name (FQN) instead of raw names. | Boolean  | `false`              |
| `storeSampleData`             | Option to enable or disable storing sample data for each table.                                 | Boolean  | `true`               |
| `enableAutoClassification`    | Enables automatic tagging of columns that might contain sensitive information.                  | Boolean  | `false`              |
| `confidence`                  | Confidence level for tagging columns as sensitive. Value ranges from 0 to 100.                  | Number   | `80`                 |
| `sampleDataCount`             | Number of sample rows to ingest when Store Sample Data is enabled.                              | Integer  | `50`                 |

## Key Parameters Explained

### `enableAutoClassification`

* Set this to `true` to enable automatic detection of sensitive columns (e.g., PII).
* Applies pattern recognition and tagging based on predefined criteria.

### `confidence`

* Confidence level for tagging sensitive columns:
  * A higher confidence value (e.g., `90`) reduces false positives but may miss some sensitive data.
  * A lower confidence value (e.g., `70`) identifies more sensitive columns but may result in false positives.

### `storeSampleData`

* Controls whether sample rows are stored during ingestion.
* If enabled, the specified number of rows (`sampleDataCount`) will be fetched for each table.

### `useFqnForFiltering`

* When set to `true`, filtering patterns will be applied to the Fully Qualified Name of a table (e.g., `service_name.db_name.schema_name.table_name`).
* When set to `false`, filtering applies only to raw table names.

## Auto Classification Workflow Execution

To execute the **Auto Classification Workflow**, follow the steps below:

### 1. Install the Required Python Package

Ensure you have the correct OpenMetadata ingestion package installed, including the **PII Processor** module:

```bash theme={null}
pip install "openmetadata-ingestion[pii-processor]"
```

## 2. Define and Execute the Python Workflow

Instead of using a YAML configuration, use the AutoClassificationWorkflow from OpenMetadata to trigger the ingestion process programmatically.

## Sample Auto Classification Workflow yaml

```yaml theme={null}
source:
  type: bigquery
  serviceName: local_bigquery
  serviceConnection:
    config:
      type: BigQuery
      credentials:
        gcpConfig:
          type: service_account
          projectId: my-project-id-1234
          privateKeyId: privateKeyID
          privateKey: "-----BEGIN PRIVATE KEY-----\nmySuperSecurePrivateKey==\n-----END PRIVATE KEY-----\n"
          clientEmail: client@email.secure
          clientId: "1234567890"
          authUri: https://accounts.google.com/o/oauth2/auth
          tokenUri: https://oauth2.googleapis.com/token
          authProviderX509CertUrl: https://www.googleapis.com/oauth2/v1/certs
          clientX509CertUrl: https://www.googleapis.com/oauth2/v1/certs
  sourceConfig:
    config:
      type: AutoClassification
      storeSampleData: true
      enableAutoClassification: true
      databaseFilterPattern:
        includes:
          - hello-world-1234
      schemaFilterPattern:
        includes:
          - super_schema
      tableFilterPattern:
        includes:
          - abc

processor:
   type: "orm-profiler"
   config:
    tableConfig:
      - fullyQualifiedName: local_bigquery.hello-world-1234.super_schema.abc
        profileSample: 85
        partitionConfig:
          partitionQueryDuration: 180
        columnConfig:
          excludeColumns:
            - a
            - b

sink:
  type: metadata-rest
  config: {}
workflowConfig:
#  loggerLevel: INFO # DEBUG, INFO, WARN or ERROR
  openMetadataServerConfig:
    hostPort: http://localhost:8585/api
    authProvider: openmetadata
    securityConfig:
      jwtToken: "eyJraWQiOiJHYjM4OWEtOWY3Ni1nZGpzLWE5MmotMDI0MmJrOTQzNTYiLCJ0eXAiOiJKV1QiLCJhbGciOiJSUzI1NiJ9.eyJzdWIiOiJhZG1pbiIsImlzQm90IjpmYWxzZSwiaXNzIjoib3Blbi1tZXRhZGF0YS5vcmciLCJpYXQiOjE2NjM5Mzg0NjIsImVtYWlsIjoiYWRtaW5Ab3Blbm1ldGFkYXRhLm9yZyJ9.tS8um_5DKu7HgzGBzS1VTA5uUjKWOCU0B_j08WXBiEC0mr0zNREkqVfwFDD-d24HlNEbrqioLsBuFRiwIWKc1m_ZlVQbG7P36RUxhuv2vbSp80FKyNM-Tj93FDzq91jsyNmsQhyNv_fNr3TXfzzSPjHt8Go0FMMP66weoKMgW2PbXlhVKwEuXUHyakLLzewm9UMeQaEiRzhiTMU3UkLXcKbYEJJvfNFcLwSl9W8JCO_l0Yj3ud-qt_nQYEZwqW6u5nfdQllN133iikV4fM5QZsMCnm8Rq1mvLR0y9bmJiD7fwM1tmJ791TUWqmKaTnP49U493VanKpUAfzIiOiIbhg"
```

### 3. Expected Outcome

* Automatically classifies and tags sensitive data based on predefined patterns and confidence levels.
* Improves metadata enrichment and enhances data governance practices.
* Provides visibility into sensitive data across databases.
  This approach ensures that the Auto Classification Workflow is executed correctly using the appropriate OpenMetadata ingestion framework.

## Auto Classification

The Auto Classification workflow will be using the `orm-profiler` processor.

After running a Metadata Ingestion workflow, we can run the Auto Classification workflow.
While the `serviceName` will be the same to that was used in Metadata Ingestion, so the ingestion bot can get the `serviceConnection` details from the server.

### 1. Define the YAML Config

This is a sample config for the Auto Classification Workflow:

<CodePreview>
  <ContentPanel>
    <ContentSection id={1} title="Source Configuration" lines="1-3">
      Configure the source type and service name for your auto classification workflow.
    </ContentSection>

    <ContentSection id={2} title="Auto Classification Config Type" lines="4-6">
      **type**: Set to `AutoClassification` for automatic PII tagging.
    </ContentSection>

    <ContentSection id={3} title="Store Sample Data" lines="7">
      **storeSampleData**: Option to turn on/off storing sample data. If enabled, we will ingest sample data for each table.
    </ContentSection>

    <ContentSection id={4} title="Enable Auto Classification" lines="8">
      **enableAutoClassification**: Optional configuration to automatically tag columns that might contain sensitive information.
    </ContentSection>

    <ContentSection id={5} title="Confidence" lines="9">
      **confidence**: Set the Confidence value for which you want the column to be tagged as PII. Confidence value ranges from 0 to 100. A higher number will yield less false positives but more false negatives. A lower number will yield more false positives but less false negatives.
    </ContentSection>

    <ContentSection id={6} title="Database Filter Pattern" lines="10-15">
      **databaseFilterPattern**: Regex to only fetch databases that matches the pattern.
    </ContentSection>

    <ContentSection id={7} title="Schema Filter Pattern" lines="16-21">
      **schemaFilterPattern**: Regex to only fetch tables or databases that matches the pattern.
    </ContentSection>

    <ContentSection id={8} title="Table Filter Pattern" lines="22-27">
      **tableFilterPattern**: Regex to only fetch tables or databases that matches the pattern.
    </ContentSection>

    <ContentSection id={9} title="Processor Configuration" lines="28-30">
      Choose the `orm-profiler`. Its config can also be updated to define tests from the YAML itself instead of the UI.

      **tableConfig**: `tableConfig` allows you to set up some configuration at the table level.
    </ContentSection>

    <ContentSection id={10} title="Sink Configuration" lines="31-33">
      To send the metadata to OpenMetadata, it needs to be specified as `type: metadata-rest`.
    </ContentSection>
  </ContentPanel>

  <CodePanel fileName="{connector}_auto_classification.yaml">
    ```yaml theme={null}
    source:
      type: {connector}
      serviceName: {connector}
      sourceConfig:
        config:
          type: AutoClassification
          # storeSampleData: true
          # enableAutoClassification: true
          # confidence: 80
          # databaseFilterPattern:
          #   includes:
          #     - database1
          #     - database2
          #   excludes:
          #     - database3
          # schemaFilterPattern:
          #   includes:
          #     - schema1
          #     - schema2
          #   excludes:
          #     - schema3
          # tableFilterPattern:
          #   includes:
          #     - table1
          #     - table2
          #   excludes:
          #     - table3
    processor:
      type: orm-profiler
      config: {}
    sink:
      type: metadata-rest
      config: {}
    ```
  </CodePanel>
</CodePreview>

### 2. Run with the CLI

After saving the YAML config, we will run the command the same way we did for the metadata ingestion:

```bash theme={null}
metadata classify -c <path-to-yaml>
```

<Tip>
  Now instead of running `ingest`, we are using the `classify` command to select the Auto Classification workflow.
</Tip>

## Workflow Execution

### To Execute the Auto Classification Workflow:

1. **Create a Pipeline**
   * Configure the Auto Classification JSON as demonstrated in the provided configuration example.

2. **Run the Ingestion Pipeline**
   * Use OpenMetadata or an external scheduler like Argo to trigger the pipeline execution.

3. **Validate Results**
   * Verify the metadata and tags applied to sensitive columns in the OpenMetadata UI.

### Expected Outcomes

* **Automatic Tagging:**
  Columns containing sensitive information (e.g., names, emails, SSNs) are automatically tagged based on predefined confidence levels.

* **Enhanced Visibility:**
  Gain improved visibility and classification of sensitive data within your databases.

* **Sample Data Integration:**
  Store sample data to provide better insights during profiling and testing workflows.
