how-to-guides

No menu items for this category
OpenMetadata Documentation

DataFrame Validation

The DataFrameValidator class enables you to validate pandas DataFrames directly within your ETL workflows, before data reaches its destination. This allows you to catch data quality issues early, preventing bad data from contaminating your data warehouse or analytics systems.

DataFrame validation is ideal for:

  • Validating transformed data before loading to destinations
  • Processing large datasets in chunks with memory efficiency
  • Short-circuiting ETL pipelines on validation failures
  • Providing immediate feedback during data transformations
  • Publishing validation results back to OpenMetadata

Add test definitions to validate your DataFrame:

Here's a complete example of validating transformed data in an ETL pipeline:

Instead of defining tests in code, load tests that are configured in OpenMetadata:

This approach enables:

  • Separation of concerns: Data stewards define quality criteria in UI, engineers execute in code
  • Dynamic test updates: Test criteria changes don't require code deployments
  • Consistency: Same tests used for table validation and DataFrame validation

For large datasets that don't fit in memory, validate data in chunks:

The run() method provides a cleaner approach with automatic chunk handling:

Use a context manager to ensure atomic transactions:

As of version 1.11.0.0 of the SDK, DataFrameValidator supports only one failure mode: short circuit.

Future versions will include additional modes to report back failing rows or skipping failing batches.

After validation, publish results back to OpenMetadata for tracking and alerting:

This enables:

  • Historical tracking of data quality trends
  • Alerting on validation failures
  • Visualization in OpenMetadata UI
  • Centralized data quality reporting

When using chunk-based validation, be aware of tests that require the full dataset:

Some tests analyze the entire dataset and may produce incorrect results when run on chunks:

  • TableRowCountToBeBetween: Counts rows in each chunk, not the full dataset
  • TableRowCountToEqual: Validates chunk size, not full dataset size
  • ColumnValuesSumToBeBetween: Sums values per chunk, not across all data

The SDK will issue a warning when such tests are detected:

For datasets that don't fit in memory and require full-table tests:

  1. Use TestRunner to validate after loading
  2. Focus DataFrame validation on column-level tests that do not require aggregation
  3. Split validation into two phases:
    • During ETL: Validate column-level quality with DataFrameValidator
    • After loading: Validate table-level metrics with TestRunner

Example two-phase approach:

  1. Validate before loading: Catch issues before contaminating your warehouse

  2. Use transactional chunk processing: Ensure atomic all-or-nothing behavior

  3. Leverage OpenMetadata tests: Let data stewards define quality criteria

  4. Publish results: Enable tracking and alerting

  5. Handle failures gracefully: Don't silently fail

  6. Use appropriate tests for chunks: Avoid full-table tests when processing chunks

Handle validation errors appropriately: