Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

#291: Project improvements after 0.3.0 release #300

Merged
merged 11 commits into from
Nov 22, 2024
2 changes: 1 addition & 1 deletion .github/CODEOWNERS
Validating CODEOWNERS rules …
Original file line number Diff line number Diff line change
@@ -1 +1 @@
* @benedeki @lsulak @TebaleloS @Zejnilovic @dk1844 @salamonpavel
* @benedeki @lsulak @Zejnilovic @dk1844 @salamonpavel @ABLL526
20 changes: 13 additions & 7 deletions .github/workflows/build.yml
Original file line number Diff line number Diff line change
Expand Up @@ -30,19 +30,20 @@ jobs:
- name: Checkout code
uses: actions/checkout@v4
with:
fetch-depth: 0
fetch-depth: 1
persist-credentials: false
- uses: coursier/cache-action@v5

- name: Setup Scala
uses: olafurpg/setup-scala@v14
with:
java-version: "[email protected]"

- name: Build and run unit tests
run: sbt "project model" test doc "project reader" test doc "project agent_spark3" test doc
- name: Build and run tests
run: sbt testAll

- name: Build and run integration tests
run: sbt "project model" testIT "project reader" testIT "project agent_spark3" testIT
- name: Generate documenation
run: sbt doc

test-database-and-server:
name: Test Database and Server
Expand All @@ -64,7 +65,8 @@ jobs:
- name: Checkout code
uses: actions/checkout@v4
with:
fetch-depth: 0
fetch-depth: 1
persist-credentials: false
- uses: coursier/cache-action@v5

- name: Setup Scala
Expand All @@ -73,10 +75,14 @@ jobs:
java-version: "[email protected]"

- name: Build and run unit tests
run: sbt "project database" test doc "project server" test doc
run: sbt "project database" test "project server" test

- name: Prepare testing database
run: sbt flywayMigrate

- name: Build and run integration tests
run: sbt "project database" testIT "project server" testIT

- name: Generate documentation
run: sbt "project database" doc "project server" doc

3 changes: 2 additions & 1 deletion .github/workflows/format_check.yml
Original file line number Diff line number Diff line change
Expand Up @@ -27,8 +27,9 @@ jobs:
runs-on: ubuntu-latest
steps:
- name: Checkout code
uses: actions/checkout@v2
uses: actions/checkout@v4
with:
persist-credentials: false
fetch-depth: 0
ref: ${{ github.event.pull_request.head.ref }}

Expand Down
2 changes: 2 additions & 0 deletions .github/workflows/jacoco_report.yml
Original file line number Diff line number Diff line change
Expand Up @@ -50,6 +50,8 @@ jobs:
steps:
- name: Checkout code
uses: actions/checkout@v4
with:
persist-credentials: false
- name: Setup Scala
uses: olafurpg/setup-scala@v14
with:
Expand Down
4 changes: 3 additions & 1 deletion .github/workflows/license_check.yml
Original file line number Diff line number Diff line change
Expand Up @@ -27,7 +27,9 @@ jobs:
runs-on: ubuntu-latest
steps:
- name: Checkout code
uses: actions/checkout@v2
uses: actions/checkout@v4
with:
persist-credentials: false
- name: Setup Scala
uses: olafurpg/setup-scala@v10
with:
Expand Down
94 changes: 0 additions & 94 deletions .github/workflows/pr_release_note_comment_check.yml

This file was deleted.

44 changes: 44 additions & 0 deletions .github/workflows/release-notes-presence-check.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,44 @@
#
# Copyright 2021 ABSA Group Limited
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
#

name: Release Notes Presence Check

on:
pull_request:
types: [opened, synchronize, reopened, edited, labeled, unlabeled]
branches: [ master ]

env:
SKIP_LABEL: 'no RN'
RLS_NOTES_TAG_REGEX: 'Release Notes:'

jobs:
release-notes-presence-check:
name: Release Notes Presence Check
runs-on: ubuntu-latest

steps:
- uses: actions/[email protected]
with:
python-version: '3.11'

- name: Check presence of release notes in PR description
uses: AbsaOSS/[email protected]
env:
GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
with:
github-repository: ${{ github.repository }}
pr-number: ${{ github.event.number }}
11 changes: 10 additions & 1 deletion .github/workflows/release_draft.yml
Original file line number Diff line number Diff line change
Expand Up @@ -28,6 +28,7 @@ jobs:
steps:
- uses: actions/checkout@v4
with:
persist-credentials: false
fetch-depth: 0
# the following step is disabled because it doesn't order the version tags correctly
# - name: Validate format of received tag
Expand Down Expand Up @@ -104,6 +105,7 @@ jobs:
steps:
- uses: actions/checkout@v4
with:
persist-credentials: false
fetch-depth: 0
ref: refs/tags/${{ github.event.inputs.tagName }}

Expand All @@ -119,10 +121,17 @@ jobs:
with:
tag-name: ${{ github.event.inputs.tagName }}
chapters: '[
{"title": "No entry 🚫", "label": "duplicate"},
{"title": "No entry 🚫", "label": "invalid"},
{"title": "No entry 🚫", "label": "wontfix"},
{"title": "No entry 🚫", "label": "no RN"},
{"title": "Breaking Changes 💥", "label": "breaking-change"},
{"title": "New Features 🎉", "label": "enhancement"},
{"title": "New Features 🎉", "label": "feature"},
{"title": "Bugfixes 🛠", "label": "bug"}
{"title": "Bugfixes 🛠", "label": "bug"},
{"title": "Infrastructure ⚙️", "label": "infrastructure"},
{"title": "Silent-live 🤫", "label": "silent-live"},
{"title": "Documentation 📜", "label": "documentation"}
]'
duplicity-scope: 'service'
duplicity-icon: '🔁'
Expand Down
2 changes: 2 additions & 0 deletions .github/workflows/release_publish.yml
Original file line number Diff line number Diff line change
Expand Up @@ -27,6 +27,7 @@ jobs:
- name: Checkout code
uses: actions/checkout@v4
with:
persist-credentials: false
fetch-depth: 0
- uses: coursier/cache-action@v5

Expand All @@ -51,6 +52,7 @@ jobs:
- name: Checkout code
uses: actions/checkout@v4
with:
persist-credentials: false
fetch-depth: 0
- uses: coursier/cache-action@v5

Expand Down
4 changes: 3 additions & 1 deletion .github/workflows/test_filenames_check.yml
Original file line number Diff line number Diff line change
Expand Up @@ -27,7 +27,9 @@ jobs:
runs-on: ubuntu-latest
steps:
- name: Checkout code
uses: actions/checkout@v2
uses: actions/checkout@v4
with:
persist-credentials: false

- name: Filename Inspector
id: scan-test-files
Expand Down
75 changes: 75 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,19 @@
# Atum Service

[![Build](https://github.com/AbsaOSS/spark-commons/actions/workflows/build.yml/badge.svg)](https://github.com/AbsaOSS/spark-commons/actions/workflows/build.yml)
[![License](http://img.shields.io/:license-apache-blue.svg)](http://www.apache.org/licenses/LICENSE-2.0.html)
[![Maintenance](https://img.shields.io/badge/Maintained%3F-yes-green.svg)](https://GitHub.com/Naereen/StrapDown.js/graphs/commit-activity)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

beautiful!


| Atum Server | Atum Agent | Atum Model | Atum Reader |
|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|---|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| [![GitHub release](https://img.shields.io/github/release/AbsaOSS/atum-service.svg)](https://GitHub.com/AbsaOSS/atum-service/releases/) | [![Maven Central](https://maven-badges.herokuapp.com/maven-central/za.co.absa.atum-service/atum-agent-spark3_2.13/badge.svg)](https://central.sonatype.com/search?q=atum-agent&namespace=za.co.absa.atum-service) | [![Maven Central](https://maven-badges.herokuapp.com/maven-central/za.co.absa.atum-service/atum-model_2.13/badge.svg)](https://central.sonatype.com/search?q=atum-model&namespace=za.co.absa.atum-service) | [![Maven Central](https://maven-badges.herokuapp.com/maven-central/za.co.absa.atum-service/atum-reader_2.13/badge.svg)](https://central.sonatype.com/search?q=atum-reader&namespace=za.co.absa.atum-service) |




- [Atum Service](#atum-service)
- [Motivation](#motivation)
- [Features](#features)
- [Modules](#modules)
- [Agent `agent/`](#agent-agent)
- [Reader `reader/`](#agent-agent)
Expand All @@ -15,6 +28,9 @@
- [Measurement](#measurement)
- [Checkpoint](#checkpoint)
- [Data Flow](#data-flow)
- [Usage](#usage)
- [Atum Agent routines](#atum-agent-routines)
- [Control measurement types](#control-measurement-types)
- [How to generate Code coverage report](#how-to-generate-code-coverage-report)
- [How to Run in IntelliJ](#how-to-run-in-intellij)
- [How to Run Tests](#how-to-run-tests)
Expand All @@ -41,6 +57,39 @@ functions and are stored on a single central place, in a relational database. Co
checkpoints is not only helpful for complying with strict regulatory frameworks, but also helps during development
and debugging of your Spark-based data processing.

## Motivation

Big Data strategy for a company usually includes data gathering and ingestion processes.
That is the definition of how data from different systems operating inside a company
are gathered and stored for further analysis and reporting. An ingestion processes can involve
various transformations like:
* Converting between data formats (XML, CSV, etc.)
* Data type casting, for example converting XML strings to numeric values
* Joining reference tables. For example this can include enriching existing
data with additional information available through dictionary mappings.
This constitutes a common ETL (Extract, Transform and Load) process.

During such transformations, sometimes data can get corrupted (e.g. during casting), records can
get added or lost. For instance, *outer joining* a table holding duplicate keys can result in records explosion.
And *inner joining* a table which has no matching keys for some records will result in loss of records.

In regulated industries it is crucial to ensure data integrity and accuracy. For instance, in the banking industry
the BCBS set of regulations requires analysis and reporting to be based on data accuracy and integrity principles.
Thus it is critical at the ingestion stage to preserve the accuracy and integrity of the data gathered from a
source system.

The purpose of Atum is to provide means of ensuring no critical fields have been modified during the processing and no
records are added or lost. To do this the library provides an ability to calculate *control numbers* of explicitly
specified columns using a selection of agregate function. We call the set of such measurements at a given time
a *checkpoint* and each value - a result of the function computation - we call a *control measurement*. Checkpoints can
be calculated anytime between Spark transformations and actions, so as at the start of the process or after its end.

We assume the data for ETL are processed in a series of batch jobs. Let's call each data set for a given batch
job a *batch*. All checkpoints are calculated for a specific batch.

## Features

TBD

## Modules

Expand Down Expand Up @@ -157,6 +206,32 @@ The journey of a dataset throughout various data transformations and pipelines.
even if it involves multiple applications or ETL pipelines.


## Usage

### Atum Agent routines

TBD

### Control measurement types

The control measurement of one or more columns is an aggregation function result executed over the dataset. It can be
calculated differently depending on the column's data type, on business requirements and function used. This table
represents all currently supported measurement types (aka measures):

| Type | Description |
|------------------------------------|:--------------------------------------------------------------|
| AtumMeasure.RecordCount | Calculates the number of rows in the dataset |
| AtumMeasure.DistinctRecordCount | Calculates DISTINCT(COUNT(()) of the specified column |
| AtumMeasure.SumOfValuesOfColumn | Calculates SUM() of the specified column |
| AtumMeasure.AbsSumOfValuesOfColumn | Calculates SUM(ABS()) of the specified column |
| AtumMeasure.SumOfHashesOfColumn | Calculates SUM(CRC32()) of the specified column |
| Measure.UnknownMeasure | Custom measure where the data are provided by the application |

[//]: # (| controlType.aggregatedTruncTotal | Calculates SUM(TRUNC()) of the specified column |)
lsulak marked this conversation as resolved.
Show resolved Hide resolved

[//]: # (| controlType.absAggregatedTruncTotal | Calculates SUM(TRUNC(ABS())) of the specified column |)


## How to generate Code coverage report
```sbt
sbt jacoco
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -40,7 +40,7 @@ class AgentServerCompatibilityTests extends DBTestSuite {
.add(StructField("columnForSum", DoubleType))

// Need to add service & pg run in CI
test("Agent should be compatible with server") {
ignore("Agent should be compatible with server") {

val expectedMeasurement = JsonBString(
"""{"mainValue": {"value": "4", "valueType": "Long"}, "supportValues": {}}""".stripMargin
Expand Down
Loading
Loading