Sampler Module
The sampler module lives atingestion/src/metadata/sampler/. It provides a unified interface for sampling data from tables across different database backends. Both the profiler and data quality modules depend on the sampler to get a representative subset of rows for computing metrics and running tests.
Directory Layout
How It Fits Together
The sampler sits between the data source and the profiler/data quality modules:Core Abstraction
SamplerInterface (sampler_interface.py) is the abstract base class all samplers extend.
Key Methods
| Method | Purpose |
|---|---|
create(...) | Factory method — creates a sampler with all dependencies |
get_dataset() | Returns the sampled dataset (CTE for SQL, DataFrame for pandas) |
fetch_sample_data(columns) | Fetches actual row data as TableData |
generate_sample_data() | Full pipeline: fetch → truncate → optionally upload |
get_columns() | Returns columns (respecting include/exclude filters) |
raw_dataset | Abstract property — the unsampled table/DataFrame |
close() | Cleanup connections |
Constructor Parameters
Data Models
Defined inmodels.py:
Sampling Strategies
Percentage-Based Sampling
The default strategy. Takes X% of rows from the table. With randomization (default):Row-Count-Based Sampling
Takes exactly N rows. With randomization:Database-Specific Overrides
Each database sampler can overrideset_tablesample() to use the database’s native sampling:
| Database | Sampling Method | Notes |
|---|---|---|
| PostgreSQL | BERNOULLI or SYSTEM | Configurable via samplingMethodType |
| BigQuery | SYSTEM (X PERCENT) | No TABLESAMPLE for views |
| Snowflake | BERNOULLI, SYSTEM, or ROW(N ROWS) | Row-based sampling supported |
| SQL Server / Azure SQL | X PERCENT or X ROWS | No TABLESAMPLE for views |
| Databricks | CTE-based (no native TABLESAMPLE) | Array column slicing to prevent OOM |
| Trino | CTE-based | NaN filtering for float columns |
| TimescaleDB | PostgreSQL BERNOULLI/SYSTEM | Restricts to uncompressed chunks only |
Partition Handling
partition.py detects and configures partition filtering so the sampler only reads relevant partitions:
Configuration Resolution
config.py provides hierarchical config lookup (table → schema → database → default):
Notable Database-Specific Behavior
TimescaleDB — Compressed Chunk Awareness
TimescaleDB compresses old data into compressed chunks. Decompressing during profiling would be extremely expensive. The TimescaleDB sampler:- Queries TimescaleDB metadata to find the boundary between compressed and uncompressed chunks
- Adds a filter
WHERE time_col >= uncompressed_boundaryto the sample query - Only samples from uncompressed (recent) data
Databricks — Array Column Slicing
Large array columns can cause OOM errors. The Databricks sampler:- Detects
CustomArraytype columns - Replaces them with
slice(col, 1, N)in the SELECT to limit array elements - Converts numpy arrays back to Python lists in results
BigQuery — Struct Column Handling
BigQuery struct columns (nested fields likeaddress.city) require special handling:
- Detects struct columns via
_handle_struct_columns() - Builds queries that properly reference nested fields
- Handles project ID correction from entity database name
Trino — NaN Filtering
Trino float columns can contain NaN values that break downstream processing:- Identifies float columns via
FLOAT_SETregistry - Wraps float columns in
CASE WHEN IS_NAN(col) THEN NULL ELSE col END
Pandas / Datalake Sampler
For non-SQL sources,DatalakeSampler works with DataFrames:
- Partitioned DataFrames via
get_partitioned_df() - Custom queries via
get_sampled_query_dataframe() - Chunked processing via DataFrame iterators
- NaN value filtering
NoSQL Sampler
For NoSQL databases (MongoDB, DynamoDB),NoSQLSampler:
- Uses
NoSQLAdaptorto abstract database-specific operations - Converts percentage to row count:
num_rows * (profileSample / 100) - Calls
client.scan(limit=N)for sampling - Transposes list-of-dicts format into columnar
TableData
Adding a Database-Specific Sampler
Create the sampler file
Create
sampler/sqlalchemy/{dialect}/sampler.py.Extend SQASampler and override set_tablesample():Register the sampler
The sampler is discovered via
import_sampler_class() which resolves the class dynamically based on the database service type. Ensure your module path follows the convention:Key Design Patterns
| Pattern | Where | Why |
|---|---|---|
| Factory | SamplerInterface.create() | Unified creation with config resolution |
| Strategy | set_tablesample() overrides per database | Each database has its own sampling syntax |
| Template Method | generate_sample_data() | Shared flow (fetch → truncate → upload) with subclass hooks |
| Hierarchical Config | config.py helpers | Table → schema → database → default resolution |
| CTE-based Sampling | SQASampler.get_dataset() | Clean separation of partition filtering, sampling, and query |
Key Files Quick Reference
| What you want to do | Start here |
|---|---|
| Understand the base interface | sampler_interface.py |
| See sampling config models | models.py |
| Understand config resolution | config.py |
| See partition handling | partition.py |
| Read the base SQL sampler | sqlalchemy/sampler.py |
| See a database-specific sampler | sqlalchemy/postgres/sampler.py (simplest) |
| See complex database handling | sqlalchemy/bigquery/sampler.py or sqlalchemy/timescale/sampler.py |
| Understand DataFrame sampling | pandas/sampler.py |
| See NoSQL sampling | nosql/sampler.py |
| See how profiler uses the sampler | profiler/source/database/base/profiler_source.py |
| See how DQ uses the sampler | data_quality/runner/base_test_suite_source.py |