Chunk-Based Validation
For large datasets that don’t fit in memory, validate data in chunks:Method 1: Manual Chunk Validation
Method 2: Using the run() Method
The run() method provides a cleaner approach with automatic chunk handling:
Transaction-Safe Chunk Processing
Use a context manager to ensure atomic transactions:Failure Modes
As of version 1.11.0.0 of the SDK, DataFrameValidator supports only one failure mode: short circuit.Working with Validation Results
Accessing Test Results
Merging Results from Multiple Chunks
Publishing Results to OpenMetadata
After validation, publish results back to OpenMetadata for tracking and alerting:- Historical tracking of data quality trends
- Alerting on validation failures
- Visualization in OpenMetadata UI
- Centralized data quality reporting
Important Considerations for Chunk-Based Validation
When using chunk-based validation, be aware of tests that require the full dataset:Tests That Require Full Table
Some tests analyze the entire dataset and may produce incorrect results when run on chunks:TableRowCountToBeBetween: Counts rows in each chunk, not the full datasetTableRowCountToEqual: Validates chunk size, not full dataset sizeColumnValuesSumToBeBetween: Sums values per chunk, not across all data
Recommended Approach
For datasets that don’t fit in memory and require full-table tests:- Use TestRunner to validate after loading
- Focus DataFrame validation on column-level tests that do not require aggregation
- Split validation into two phases:
- During ETL: Validate column-level quality with DataFrameValidator
- After loading: Validate table-level metrics with TestRunner
Best Practices
-
Validate before loading: Catch issues before contaminating your warehouse
-
Use transactional chunk processing: Ensure atomic all-or-nothing behavior
-
Leverage OpenMetadata tests: Let data stewards define quality criteria
-
Publish results: Enable tracking and alerting
-
Handle failures gracefully: Don’t silently fail
- Use appropriate tests for chunks: Avoid full-table tests when processing chunks
Error Handling
Handle validation errors appropriately:Next Steps
- Review the Test Definitions Reference for all available tests
- Learn about TestRunner for validating tables after loading
- Explore Advanced Usage patterns and configurations