Auto Classification Workflow Configuration
The Auto Classification Workflow enables automatic tagging of sensitive information within databases. Below are the configuration parameters available in the Service Classification Pipeline JSON.
Pipeline Configuration Parameters
Parameter | Description | Type | Default Value |
---|---|---|---|
type | Specifies the pipeline type. | String | AutoClassification |
classificationFilterPattern | Regex to compute metrics for tables matching specific tags, tiers, or glossary patterns. | Object | N/A |
schemaFilterPattern | Regex to fetch schemas matching the specified pattern. | Object | N/A |
tableFilterPattern | Regex to exclude tables matching the specified pattern. | Object | N/A |
databaseFilterPattern | Regex to fetch databases matching the specified pattern. | Object | N/A |
includeViews | Option to include or exclude views during metadata ingestion. | Boolean | true |
useFqnForFiltering | Determines whether filtering is applied to the Fully Qualified Name (FQN) instead of raw names. | Boolean | false |
storeSampleData | Option to enable or disable storing sample data for each table. | Boolean | true |
enableAutoClassification | Enables automatic tagging of columns that might contain sensitive information. | Boolean | false |
confidence | Confidence level for tagging columns as sensitive. Value ranges from 0 to 100. | Number | 80 |
sampleDataCount | Number of sample rows to ingest when Store Sample Data is enabled. | Integer | 50 |
Key Parameters Explained
enableAutoClassification
- Set this to
true
to enable automatic detection of sensitive columns (e.g., PII). - Applies pattern recognition and tagging based on predefined criteria.
confidence
- Confidence level for tagging sensitive columns:
- A higher confidence value (e.g.,
90
) reduces false positives but may miss some sensitive data. - A lower confidence value (e.g.,
70
) identifies more sensitive columns but may result in false positives.
- A higher confidence value (e.g.,
storeSampleData
- Controls whether sample rows are stored during ingestion.
- If enabled, the specified number of rows (
sampleDataCount
) will be fetched for each table.
useFqnForFiltering
- When set to
true
, filtering patterns will be applied to the Fully Qualified Name of a table (e.g.,service_name.db_name.schema_name.table_name
). - When set to
false
, filtering applies only to raw table names.
Auto Classification Workflow Execution
To execute the Auto Classification Workflow, follow the steps below:
1. Install the Required Python Package
Ensure you have the correct OpenMetadata ingestion package installed, including the PII Processor module:
2. Define and Execute the Python Workflow
Instead of using a YAML configuration, use the AutoClassificationWorkflow from OpenMetadata to trigger the ingestion process programmatically.
Sample Auto Classification Workflow yaml
3. Expected Outcome
- Automatically classifies and tags sensitive data based on predefined patterns and confidence levels.
- Improves metadata enrichment and enhances data governance practices.
- Provides visibility into sensitive data across databases. This approach ensures that the Auto Classification Workflow is executed correctly using the appropriate OpenMetadata ingestion framework.
Auto Classification
The Auto Classification workflow will be using the orm-profiler
processor.
After running a Metadata Ingestion workflow, we can run the Auto Classification workflow. While the serviceName
will be the same to that was used in Metadata Ingestion, so the ingestion bot can get the serviceConnection
details from the server.
1. Define the YAML Config
This is a sample config for the Auto Classification Workflow:
Source Configuration - Source Config
You can find all the definitions and types for the sourceConfig
here.
storeSampleData: Option to turn on/off storing sample data. If enabled, we will ingest sample data for each table.
enableAutoClassification: Optional configuration to automatically tag columns that might contain sensitive information.
confidence: Set the Confidence value for which you want the column to be tagged as PII. Confidence value ranges from 0 to 100. A higher number will yield less false positives but more false negatives. A lower number will yield more false positives but less false negatives.
databaseFilterPattern: Regex to only fetch databases that matches the pattern.
schemaFilterPattern: Regex to only fetch tables or databases that matches the pattern.
tableFilterPattern: Regex to only fetch tables or databases that matches the pattern.
Processor Configuration
Choose the orm-profiler
. Its config can also be updated to define tests from the YAML itself instead of the UI:
tableConfig: tableConfig
allows you to set up some configuration at the table level.
Sink Configuration
To send the metadata to OpenMetadata, it needs to be specified as type: metadata-rest
.
Workflow Configuration
The main property here is the openMetadataServerConfig
, where you can define the host and security provider of your OpenMetadata installation.
Logger Level
You can specify the loggerLevel
depending on your needs. If you are trying to troubleshoot an ingestion, running with DEBUG
will give you far more traces for identifying issues.
JWT Token
JWT tokens will allow your clients to authenticate against the OpenMetadata server. To enable JWT Tokens, you will get more details here.
You can refer to the JWT Troubleshooting section link for any issues in your JWT configuration.
Store Service Connection
If set to true
(default), we will store the sensitive information either encrypted via the Fernet Key in the database or externally, if you have configured any Secrets Manager.
If set to false
, the service will be created, but the service connection information will only be used by the Ingestion Framework at runtime, and won't be sent to the OpenMetadata server.
Store Service Connection
If set to true
(default), we will store the sensitive information either encrypted via the Fernet Key in the database or externally, if you have configured any Secrets Manager.
If set to false
, the service will be created, but the service connection information will only be used by the Ingestion Framework at runtime, and won't be sent to the OpenMetadata server.
SSL Configuration
If you have added SSL to the OpenMetadata server, then you will need to handle the certificates when running the ingestion too. You can either set verifySSL
to ignore
, or have it as validate
, which will require you to set the sslConfig.caCertificate
with a local path where your ingestion runs that points to the server certificate file.
Find more information on how to troubleshoot SSL issues here.
ingestionPipelineFQN
Fully qualified name of ingestion pipeline, used to identify the current ingestion pipeline.
2. Run with the CLI
After saving the YAML config, we will run the command the same way we did for the metadata ingestion:
Now instead of running ingest
, we are using the classify
command to select the Auto Classification workflow.
Workflow Execution
To Execute the Auto Classification Workflow:
Create a Pipeline
- Configure the Auto Classification JSON as demonstrated in the provided configuration example.
Run the Ingestion Pipeline
- Use OpenMetadata or an external scheduler like Argo to trigger the pipeline execution.
Validate Results
- Verify the metadata and tags applied to sensitive columns in the OpenMetadata UI.
Expected Outcomes
Automatic Tagging:
Columns containing sensitive information (e.g., names, emails, SSNs) are automatically tagged based on predefined confidence levels.Enhanced Visibility: Gain improved visibility and classification of sensitive data within your databases.
Sample Data Integration:
Store sample data to provide better insights during profiling and testing workflows.