Data Quality

Learn how you can use OpenMetadata to define Data Quality tests and measure your data reliability.

Requirements

OpenMetadata

You must have a running deployment of OpenMetadata to use this guide. OpenMetadata includes the following services:

OpenMetadata server supporting the metadata APIs and user interface
Elasticsearch for metadata search and discovery
MySQL as the backing store for all metadata
Airflow for metadata ingestion workflows

To deploy OpenMetadata checkout the deployment guide

Python (version 3.8.0 or later)

Please use the following command to check the version of Python you have.

Building Trust with Data Quality

OpenMetadata is where all users share and collaborate around data. It is where you make your assets discoverable; with data quality you make these assets trustable.

This section will show you how to configure and run Data Quality pipelines with the OpenMetadata built-in tests.

Main Concepts

Test Suite

Test Suites are logical container allowing you to group related Test Cases together from different tables.

Test Definition

Test Definitions are generic tests definition elements specific to a test such as:

test name
column name
data type

Test Cases

Test Cases specify a Test Definition. It will define what condition a test must meet to be successful (e.g. max=n, etc.). One Test Definition can be linked to multiple Test Cases.

Adding Test Cases to an Entity

Tests cases are actual test that will be ran and executed against your entity. This is where you will define the execution time and logic of these tests Note: you will need to make sure you have the right permission in OpenMetadata to create a test.

Step 1: Creating a Test Case

Navigate to the entity you want to add a test (we currently support quality test only for database entity). Go to Profiler & Data Quality tab. From there, click on the Add Test button in the upper right corner and select the type of test you want to implement

Write your first test

Step 2: Select the Test Definition

Select the type of test you want to run and set the parameters (if any) for your test case. If you have selected a column test, you will need to select which column you want to execute your test against. Give it a name and then submit it.

Note: if you have a profiler workflow running, you will be able to visualize some context around your column or table data.

Write your first test

Step 3: Set an Execution Schedule (Optional)

If it is the first test you are creating for this entity, you'll need to set an execution time. click on Add Ingestion button and select a schedule. Note that the time is shown in UTC.

Write your first test

Adding Test Suites Through the UI

Test Suites are logical container allowing you to group related Test Cases together from different tables. Note: you will need to make sure you have the right permission in OpenMetadata to create a test.

Step 1: Creating a Test Suite

From the vertical navigation bar, click on Quality and navigate to the By Test Suites tab. From there click on Add Test Suite button on the top right corner.

Write your first test

On the next page, enter the name and description (optional) of your test suite.

Create test suite

Step 2: Add Test Cases

On the next page, you will be able to add existing test cases from different entity to your test suite. This allows you to group together test cases from different entities

Note: Test Case name needs to be unique across the whole platform. A warning message will show if your Test Case name is not unique.

Create test case

Data Quality

Adding Data Quality Test Cases from yaml config

When creating a JSON config for a test workflow the source configuration is very simple.

The only sections you need to modify here are the serviceName (this name needs to be unique) and entityFullyQualifiedName (the entity for which we'll be executing tests against) keys.

Once you have defined your source configuration you'll need to define te processor configuration.

The processor type should be set to "orm-test-runner". For accepted test definition names and parameter value names refer to the tests page.

Note that while you can define tests directly in this YAML configuration, running the workflow will execute ALL THE TESTS present in the table, regardless of what you are defining in the YAML.

This makes it easy for any user to contribute tests via the UI, while maintaining the test execution external.

You can keep your YAML config as simple as follows if the table already has tests.

Key reference:

forceUpdate: if the test case exists (base on the test case name) for the entity, implements the strategy to follow when running the test (i.e. whether or not to update parameters)
testCases: list of test cases to add to the entity referenced. Note that we will execute all the tests present in the Table.
name: test case name
testDefinitionName: test definition
columnName: only applies to column test. The name of the column to run the test against
parameterValues: parameter values of the test

The sink and workflowConfig will have the same settings as the ingestion and profiler workflow.

Full `yaml` config example

How to Run Tests

To run the tests from the CLI execute the following command

How to Visualize Test Results

From the Quality Page

From the home page click on the Quality menu item on the vertical navigation. This will bring you to the quality page where you'll be able to see your test cases either by:

entity
test suite
test cases

If you want to look at your tests grouped by Test Suites, navigate to the By Test Suites tab. This will bring you to the Test Suite page where you can select a specific Test Suite.

Test suite home page

From there you can select a Test Suite and visualize the results associated with this specific Test Suite.

Test suite results page

From a Table Entity

Navigate to your table and click on the profiler & Data Quality tab. From there you'll be able to see test results at the table or column level.

Table Level Test Results

In the top panel, click on the white background Data Quality button. This will bring you to a summary of all your quality tests at the table level

Test suite results table

Test Case Resolution Workflow

In v1.1.0 we introduce the ability for user to flag the resolution status of failed test cases. When a test case fail, it will automatically be marked as new. It indicates that a new failure has happened.

Test suite results table

The next step for a user is to mark the new failure as ack (acknowledged) signifying to users that someone is looking into test failure resolution. When hovering over the resolution status user will be able to see the time (UTC) and the user who acknowledge the failure

Test suite results table

Then the user is able to mark a test as resolved. We made it mandatory for users to 1) select a reason and 2) add a comment when resolving failed test so that knowledge can be maintain inside the platform.

Test suite results table

Data Quality

Requirements

OpenMetadata

Python (version 3.8.0 or later)

Building Trust with Data Quality

Main Concepts

Test Suite

Test Definition

Test Cases

Adding Test Cases to an Entity

Step 1: Creating a Test Case

Step 2: Select the Test Definition

Step 3: Set an Execution Schedule (Optional)

Adding Test Suites Through the UI

Step 1: Creating a Test Suite

Step 2: Add Test Cases

Data Quality

Adding Data Quality Test Cases from yaml config

Key reference:

Full yaml config example

How to Run Tests

How to Visualize Test Results

From the Quality Page

From a Table Entity

Table Level Test Results

Test Case Resolution Workflow

Full `yaml` config example