OpenMetadata
Search…
Data Quality Overview
Learn how you can use OpenMetadata to define Data Quality tests and measure your data reliability.

Requirements

OpenMetadata (version 0.10 or later)

You must have a running deployment of OpenMetadata to use this guide. OpenMetadata includes the following services:
  • OpenMetadata server supporting the metadata APIs and user interface
  • Elasticsearch for metadata search and discovery
  • MySQL as the backing store for all metadata
  • Airflow for metadata ingestion workflows

Python (version 3.8.0 or later)

Please use the following command to check the version of Python you have.
1
python3 --version
Copied!

Building Trust

OpenMetadata aims to be where all users share and collaborate around data. One of the main benefits of ingesting metadata into OpenMetadata is to make assets discoverable.
However, we need to ask ourselves: What happens after a user stumbles upon our assets? Then, we can help other teams use the data by adding proper descriptions with up-to-date information or even examples on how to extract information properly.
What is imperative to do, though, is to build trust. For example, users might find a Table that looks useful for their use case, but how can they be sure it correctly follows the SLAs? What issues has this source undergone in the past? Data Quality & tests play a significant role in making any asset trustworthy. Being able to show together the Entity information and its reliability will help our users think, "This is safe to use".
This section will show you how to configure and run Data Profiling and Quality pipelines with the supported tests.

Data Profiling

Workflows

The Ingestion Framework currently supports two types of workflows:
  • Ingestion: Captures metadata from the sources and updates the Entities' instances. This is a lightweight process that can be scheduled to have fast feedback on metadata changes in our sources. This workflow handles both the metadata ingestion as well as the usage and lineage information from the sources, when available.
  • Profiling: Extracts metrics from SQL sources and sets up and runs Data Quality tests. It requires previous executions of the Ingestion Pipeline. This is a more time-consuming workflow that will run metrics and compare their result to the configured tests of both Tables and Columns.
Note that you can configure the ingestion pipelines with source.config.data_profiler_enabled as "true" or "false" to run the profiler as well during the metadata ingestion. This, however, does not support Quality Tests.

Profiling Overview

Requirements

The source layer of the Profiling workflow is the OpenMetadata API. Based on the source configuration, this process lists the tables to be executed.

Description

The steps of the Profiling pipeline are the following:
  1. 1.
    First, use the source configuration to create a connection.
  2. 2.
    Next, iterate over the selected tables and schemas that the Ingestion has previously recorded in OpenMetadata.
  3. 3.
    Run a default set of metrics to all the table's columns. (We will add more customization in the future releases).
  4. 4.
    Finally, compare the metrics' results against the configured Data Quality tests.
Note that all the results are published to the OpenMetadata API, both the Profiling and the tests executions. This will allow users to visit the evolution of the data and its reliability directly in the UI.
You can take a look at the supported metrics and tests here:

How to Add Tests

Tests are part of the Table Entity. We can add new tests to a Table from the UI or directly use the JSON configuration of the workflows.
Note that in order to add tests and run the Profiler workflow, the metadata should have already been ingested.

Add Tests in the UI

To create a new test, we can go to the Table page under the Data Quality tab:
Data Quality Tab in the Table Page
Clicking on Add Test will allow us two options: Table Test or Column Test. A Table Test will be run on metrics from the whole table, such as the number of rows or columns, while Column Tests are specific to each column's values.

Add Table Tests

Adding a Table Test will show us the following view:
Add a Table Test
  • Test Type: It allows us to specify the test we want to configure.
  • Description: To explain why the test is necessary and what scenarios we want to validate.
  • Value: Different tests will show different values here. For example, the tableColumnCountToEqual requires us to specify the number of columns we expect. Other tests will have other forms when we need to add values such as min and max, while other tests require no value at all, such as tests validating that there are no nulls in a column.

Add Column Tests

Adding a Column Test will have a similar view:
Add Column Test
The Column Test form will be similar to the Table Test one. The only difference is the Column Name field, where we need to select the column we will be targeting for the test.
You can review the supported tests here. We will keep expanding the support for new tests in the upcoming releases.
Once tests are added, we will be able to see them in the Data Quality tab:
Freshly created tests
Note how the tests are grouped in Table and Column tests. All tests from the same column will also be grouped together. From this view, we can both edit and delete the tests if needed.
In the global Table information at the top, we will also be able to see how many Table Tests have been configured.

Add Tests with the JSON Config

In the connectors documentation for each source, we showcase how to run the Profiler Workflow using the Airflow SDK or the metadata CLI. When configuring the JSON configuration for the workflow, we can add tests as well.
Any tests added to the JSON configuration will also be reflected in the Data Quality tab. This JSON configuration can be used for both the Airflow SDK and to run the workflow with the CLI.
You can find further information on how to prepare the JSON configuration for each of the sources. However, adding any number of tests is a matter of updating the processor configuration as follows:
1
"processor": {
2
"type": "orm-profiler",
3
"config": {
4
"test_suite": {
5
"name": "<Test Suite name>",
6
"tests": [
7
{
8
"table": "<Table FQN>",
9
"table_tests": [
10
{
11
"testCase": {
12
"config": {
13
"value": 100
14
},
15
"tableTestType": "tableRowCountToEqual"
16
}
17
}
18
],
19
"column_tests": [
20
{
21
"columnName": "<Column Name>",
22
"testCase": {
23
"config": {
24
"minValue": 0,
25
"maxValue": 99
26
},
27
"columnTestType": "columnValuesToBeBetween"
28
}
29
}
30
]
31
}
32
]
33
}
34
}
35
},son
Copied!
tests is a list of test definitions that will be applied to the table, informed by its FQN. For each table, one can then define a list of table_tests and column_tests. Review the supported tests and their definitions to learn how to configure the different cases here.

How to Run Tests

Both the Profiler and Tests are executed in the Profiler Workflow. All the results will be available through the UI in the Profiler and Data Quality tabs.
Tests results in the Data Quality tab
To learn how to prepare and run the Profiler Workflow for a given source, you can take a look at the documentation for that specific connector.

Where are the Tests stored?

Once you create a Test definition for a Table or any of its Columns, that Test becomes a part of the Table Entity. This means that it does not matter from where you create tests (JSON Configuration vs. UI). As once the test gets registered to OpenMetadata, it will always be executed as part of the Profiler Workflow.
You can check what tests an Entity has configured in the Data Quality tab of the UI, or by using the API:
1
from metadata.ingestion.ometa.ometa_api import OpenMetadata
2
from metadata.ingestion.ometa.openmetadata_rest import MetadataServerConfig
3
​
4
from metadata.generated.schema.entity.data.table import Table
5
​
6
​
7
server_config = MetadataServerConfig(api_endpoint="http://localhost:8585/api")
8
metadata = OpenMetadata(server_config)
9
​
10
table = metadata.get_by_name(entity=Table, fqdn="FQDN", fields=["tests"])
Copied!
You can then check table.tableTests, or for each Column column.columnTests to get the test information.