You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: README.md
+83-24Lines changed: 83 additions & 24 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -1,7 +1,45 @@
1
1
# What is bdq?
2
+
2
3
BDQ stands for Big Data Quality, a set of tools/function that help you every day assert quality of datasets you have processed or ingested. Library leverages power of spark, hence it processes your quality checks at scale offered by spark and databricks.
3
4
5
+
Library over time evolved into DAG executor of arbitrary python/pyspark functions. This allows to scale execution with dependency tracking to next level.
6
+
7
+
## How to install?
8
+
9
+
On databricks run `%pip install bdq==x.y.z`. Make sure you tag version number you are confortable with to ensure API stability.
10
+
11
+
This package is currently in EXPERIMENTAL stage and newer releses might change APIs (function names, parameters, etc..).
12
+
13
+
## Supported spark/databricks versions
14
+
15
+
Development and testing has been performed on 12.2 LTS databricks runtime. Databricks runtime 10.4 LTS should also work with exception of `SparkPipeline.step_spark_for_each_batch` and `SparkPipeline.spark_metric` which require 12.2 LTS runtime to work.
16
+
17
+
## Verbose output
18
+
19
+
bdq utilizes python `logging` module, hence for the best logging experience configure logger, example setup:
20
+
21
+
```python
22
+
import logging
23
+
import sys
24
+
25
+
# py4j is very chatty, no need to deal with it unless it's critical
See examples bellow for short listing of major functionalities. See `tests` folder for detailed examples. Tests are meant to double as real use cases hence they serve as documentation of examples for now.
with optional primary key conflict resolution, where there are multiple records being candidate for latest record, but they have different attribuets and there is no way of determining which one is the latest.
Given list of possible columns, constructs the lists of all possible combinations of composite primary keys, and executes concurrently to determine if given set of columns is a valid primary key. Uses minimum possible amount of queries against spark by skipping validation paths that are based on already validated primary key combinations.
pipeline steps are rerunable asany ordinary function:
492
+
435
493
```python
436
494
# to rerun given step, just execute it as if it was a pure function
437
495
# return is alaways a list of dataframs that given @ppn.step returns
438
496
# note: the spark view 'raw_data_single_source' will be updated when this function finishes (as per defintion in @ppn.step above)
439
497
raw_data_single_source()
440
498
```
441
499
442
-
## Spark UI Stage descriptions
500
+
### Spark UI Stage descriptions
501
+
443
502
When running code using pyspark, spark ui gets very crowded. `SparkUILogger` context manager and decorator assings human readable names to spark stages.
0 commit comments