connectors

No menu items for this category

Profiler Workflow

Learn how to configure and run the Profiler Workflow to extract Profiler data and execute the Data Quality.

During data profiling for Datalake Profiling, we drop NaN (Not a Number) values from the DataFrame using the dropna() method. However, we make an exception for null values, which are retained. This ensures that our computations are accurate while handling missing data

After the metadata ingestion has been done correctly, we can configure and deploy the Profiler Workflow.

This Pipeline will be in charge of feeding the Profiler tab of the Table Entity, as well as running any tests configured in the Entity.

Table profile summary page

Table profile summary page

Column profile summary page

Column profile summary page

From the Service Page, go to the Ingestions tab to add a new ingestion and click on Add Profiler Ingestion.

Add a profiler service

Add a profiler service

Here you can enter the Profiler Ingestion details.

Set profiler configuration

Set profiler configuration

Name Define the name of the Profiler Workflow. While we only support a single workflow for the Metadata and Usage ingestion, users can define different schedules and filters for Profiler workflows.

As profiling is a costly task, this enables a fine-grained approach to profiling and running tests by specifying different filters for each pipeline.

Database filter pattern (Optional) regex expression to filter databases.

Schema filter pattern (Optional) regex expression to filter schemas.

Table filter pattern (Optional) regex expression to filter tables.

Profile Sample (Optional) Set the sample to be use by the profiler for the specific table.

  • Percentage: Value must be between 0 and 100 exclusive (0 < percentage < 100). This will sample the table based on a percentage
  • Row Count: The table will be sampled based on a number of rows (i.e. 1,000, 2,000), etc.

⚠️ This option is currently not support for Druid. Sampling leverage RANDOM functions in most database (some have specific sampling functions) and Druid provides neither of these option. We recommend using the partitioning or sample query option if you need to limit the amount of data scanned.

Auto PII Tagging (Optional) Configuration to automatically tag columns that might contain sensitive information.

  • Confidence (Optional) If Auto PII Tagging is enable, this confidence level will determine the threshold to use for OpenMetadata's NLP model to consider a column as containing PII data.

Thread Count (Optional) Number of thread to use when computing metrics for the profiler. For Snowflake users we recommend setting it to 1. There is a known issue with one of the dependency (snowflake-connector-python) affecting projects with certain environments.

Timeout in Seconds (Optional) This will set the duration a profiling job against a table should wait before interrupting its execution and moving on to profiling the next table. It is important to note that the profiler will wait for the hanging query to terminiate before killing the execution. If there is a risk for your profiling job to hang, it is important to also set a query/connection timeout on your database engine. The default value for the profiler timeout is 12-hours.

Ingest Sample Data Whether the profiler should ingest sample data

After clicking Next, you will be redirected to the Scheduling form. This will be the same as the Metadata and Usage Ingestions. Select your desired schedule and click on Deploy to find the usage pipeline being added to the Service Ingestions.

Once you have created your profiler you can adjust some behavior at the table level by going to the table and clicking on the profiler tab

table profile settings

table profile settings

table profile settings

table profile settings

Profile Sample Set the sample to be use by the profiler for the specific table.

  • Percentage: Value must be between 0 and 100 exclusive (0 < percentage < 100). This will sample the table based on a percentage
  • Row Count: The table will be sampled based on a number of rows (i.e. 1,000, 2,000), etc.

⚠️ This option is currently not support for Druid. Sampling leverage RANDOM functions in most database (some have specific sampling functions) and Druid provides neither of these option. We recommend using the partitioning or sample query option if you need to limit the amount of data scanned.

Profile Sample Query Use a query to sample data for the profiler. This will overwrite any profle sample set.

Enable Column Profile This setting allows user to exclude or include specific columns and metrics from the profiler.

Note: for Google BigQuery tables partitioned on timestamp/datetime column type, month and year interval are not supported. You will need to set the Interval Unit to DAY or HOUR.

Enable Partition When enabled, the profiler will fetch the data based on your profiler settings. Note that if "profile sample" is set, this configuration will be used against the partitioned data and not the whole table.

  • Column Name: this is the name of the column that will be used as the partition field
  • Interval Type:
    • TIME-UNIT: a business logic timestamp/date/datetime (e.g. order date, sign up datetime, etc.)
    • INGESTION-TIME: a process logic timestamp/date/datetime (i.e. when was my data ingested in my table)
    • COLUMN-VALUE: a value representing a chunk of data (e.g. Product Type A, B, C, etc.)
    • INTEGER-RANGE: a range of integer that will be used as the partition (e.g. Customer ID between 1 and 10)

Once you have picked the Interval Type you will need to define the configuration specific to your Interval Type.

INGESTION-TIME or INTEGER-RANGE

  • Interval: the interval value (e.g. 1, 2, etc.)
  • Interval Unit:
    • HOUR
    • DAY
    • MONTH
    • YEAR

COLUMN-VALUE

  • Value: a list of value to use for the partitioning logic

INTEGER-RANGE

  • Start Range: the start of the range (inclusive)
  • End Range: the end of the range (inclusive)

In the connectors section we showcase how to run the metadata ingestion from a JSON file using the Airflow SDK or the CLI via metadata ingest. Running a profiler workflow is also possible using a JSON configuration file.

This is a good option if you which to execute your workflow via the Airflow SDK or using the CLI; if you use the CLI a profile workflow can be triggered with the command metadata profile -c FILENAME.yaml. The serviceConnection config will be specific to your connector (you can find more information in the connectors section), though the sourceConfig for the profiler will be similar across all connectors.

This is a sample config for the profiler:

You can find all the definitions and types for the sourceConfig here.

generateSampleData: Option to turn on/off generating sample data.

computeMetrics: Option to turn on/off computing profiler metrics. This flag is useful when you want to only ingest the sample data with the profiler workflow and not any other information.

profileSample: Percentage of data or no. of rows we want to execute the profiler and tests on.

threadCount: Number of threads to use during metric computations.

processPiiSensitive: Optional configuration to automatically tag columns that might contain sensitive information.

confidence: Set the Confidence value for which you want the column to be marked

timeoutSeconds: Profiler Timeout in Seconds

databaseFilterPattern: Regex to only fetch databases that matches the pattern.

schemaFilterPattern: Regex to only fetch tables or databases that matches the pattern.

tableFilterPattern: Regex to only fetch tables or databases that matches the pattern.

Choose the orm-profiler. Its config can also be updated to define tests from the YAML itself instead of the UI:

tableConfig: tableConfig allows you to set up some configuration at the table level.

To send the metadata to OpenMetadata, it needs to be specified as type: metadata-rest.

The main property here is the openMetadataServerConfig, where you can define the host and security provider of your OpenMetadata installation.

For a simple, local installation using our docker containers, this looks like:

filename.yaml
  • You can learn more about how to configure and run the Profiler Workflow to extract Profiler data and execute the Data Quality from here

After saving the YAML config, we will run the command the same way we did for the metadata ingestion:

Note now instead of running ingest, we are using the profile command to select the Profiler workflow.

When setting a profiler workflow it is important to keep in mind that queries will be running against your database. Depending on your database engine, you may incur costs (e.g., Google BigQuery, Snowflake). Execution time will also vary depending on your database engine computing power, the size of the table, and the number of columns. Given these elements, there are a few best practices we recommend you follow.

Profiling all the tables in your data platform might not be the most optimized approach. Profiled tables give an indication of the structure of the table, which is most useful for tables where this information is valuable (e.g., tables used by analysts or data scientists, etc.).

When setting up a profiler workflow, you have the possibility to filter out/in certain databases, schemas, or tables. Using this feature will greatly help you narrow down which table you want to profile.

On a table asset, you have the possibility to add a sample percentage/rows and a partitioning logic. Doing so will significantly reduce the amount of data scanned and the computing power required to perform the different operations.

For sampling, you can set a sampling percentage at the workflow level.

By default, the profiler will compute all the metrics against all the columns. This behavior can be fine-tuned only to include or exclude specific columns and specific metrics.

For example, excluding id columns will reduce the number of columns against which the metrics are computed.

If you have a large number of tables you would like to profile, setting up multiple workflows will help distribute the load. It is important though to monitor your instance CPU, and memory as having a large amount of workflow running simultaneously will require an adapted amount of resources.