Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion .github/CODEOWNERS
Original file line number Diff line number Diff line change
@@ -1 +1 @@
* @tmikula-dev @OlivieFranklova
* @petr-pokorny-absa @lsulak @tmikula-dev
23 changes: 23 additions & 0 deletions .github/copilot-instructions.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,23 @@
Tool for exact comparison of two Parquet/CSV datasets, detecting row and column-level differences.
Monorepo: `bigfiles/` (Scala+Spark, files not fitting RAM) and `smallfiles/` (Python, files fitting RAM).

## Scala (bigfiles/)
- Scala 2.12.20 default, 2.11.12 cross-compiled via `sbt +`. Spark 3.5.3/2.4.7, Hadoop 3.3.5/2.6.5, Java 8.
- SBT 1.10.2. All sbt commands run from `bigfiles/`. Entry point: `za.co.absa.DatasetComparison`.
- `sbt test` — unit + integration (local Spark `local[*]`). `sbt jacoco` — coverage. `sbt assembly` — fat JAR.
- JaCoCo online mode: `sbt-jacoco` + `jacoco-method-filter-sbt`, rules in `bigfiles/jmf-rules.txt`, aliases in `bigfiles/.sbtrc`.
- scalafmt: dialect scala211 (cross-compat), maxColumn 120. `assemblyMergeStrategy` discards META-INF.
- No runtime services — pure Spark batch job.

## Python (smallfiles/)
- Python 3.13. Entry point: `smallfiles/main.py`. Deps pinned in `smallfiles/requirements.txt`.

## Quality gates
- Scala: JaCoCo overall >= 67% ( >= 80% is goal), changed files >= 80%. PR comments via `MoranaApps/jacoco-report`.
- Python: pytest >= 80%, pylint >= 9.5, black formatting, mypy type checking.

## Conventions
- Apache 2.0 license headers on all source files.
- Organization: `za.co.absa`. Git versioning via `sbt-git`.
- GH Actions: pinned commit SHAs for all third-party actions.
- `bigfiles/project/` — sbt build definitions only, excluded from CI change detection.
183 changes: 123 additions & 60 deletions .github/workflows/ci_python.yml
Original file line number Diff line number Diff line change
Expand Up @@ -17,105 +17,168 @@ name: CI Python

on:
pull_request:
types: [ opened, synchronize, reopened ]
push:
branches: [ master ]
workflow_dispatch:

concurrency:
group: static-python-check-${{ github.ref }}
cancel-in-progress: true

permissions:
contents: read
security-events: write

jobs:
test-smallfiles:
name: Test Small files
detect:
name: Python Changes Detection
runs-on: ubuntu-latest
outputs:
python_changed: ${{ steps.changes.outputs.python_changed }}
steps:
- name: Checkout repository
uses: actions/checkout@8e8c483db84b4bee98b60c0593521ed34d9990e8
with:
persist-credentials: false
fetch-depth: 0

- name: Check if Python files changed
id: changes
shell: bash
env:
GH_TOKEN: ${{ github.token }}
run: |
set -euo pipefail

if [[ "${{ github.event_name }}" == "pull_request" ]]; then
CHANGED_FILES=$(gh api \
"repos/${{ github.repository }}/pulls/${{ github.event.pull_request.number }}/files" \
--jq '.[].filename | select(endswith(".py") or . == "smallfiles/requirements.txt")')
else
CHANGED_FILES=$(git diff --name-only "${{ github.sha }}~1" "${{ github.sha }}" -- '*.py' 'smallfiles/requirements.txt')
fi

if [[ -n "$CHANGED_FILES" ]]; then
echo "python_changed=true" >> "$GITHUB_OUTPUT"
else
echo "python_changed=false" >> "$GITHUB_OUTPUT"
fi

pylint-analysis:
name: Pylint Static Code Analysis
needs: detect
if: needs.detect.outputs.python_changed == 'true'
runs-on: ubuntu-latest
defaults:
run:
working-directory: smallfiles
steps:
- name: Checkout code
uses: actions/checkout@v4
- name: Checkout repository
uses: actions/checkout@8e8c483db84b4bee98b60c0593521ed34d9990e8
with:
persist-credentials: false
fetch-depth: 0

- name: Set up Python
uses: actions/setup-python@v5
uses: actions/setup-python@83679a892e2d95755f2dac6acb0bfd1e9ac5d548
with:
python-version: '3.11'
python-version: '3.13'
cache: 'pip'

- name: Install dependencies
run: |
pip install -r requirements.txt
pip install coverage pytest

- name: Run tests
run: coverage run -m pytest test/
run: pip install -r smallfiles/requirements.txt

- name: Show coverage
run: coverage report -m --omit=".*.ipynb"

- name: Create coverage file
if: github.event_name == 'pull_request'
run: coverage xml
- name: Analyze code with Pylint
id: analyze-code
run: |
pylint_score=$(pylint $(git ls-files '*.py')| grep 'rated at' | awk '{print $7}' | cut -d'/' -f1)
echo "PYLINT_SCORE=$pylint_score" >> $GITHUB_ENV

- name: Get Cover
if: github.event_name == 'pull_request'
uses: orgoro/coverage@v3.1
- name: Check Pylint score
run: |
if (( $(echo "$PYLINT_SCORE < 9.5" | bc -l) )); then
echo "Failure: Pylint score is below 9.5 (project score: $PYLINT_SCORE)."
exit 1
else
echo "Success: Pylint score is above 9.5 (project score: $PYLINT_SCORE)."
fi

black-check:
name: Black Format Check
needs: detect
if: needs.detect.outputs.python_changed == 'true'
runs-on: ubuntu-latest
steps:
- name: Checkout repository
uses: actions/checkout@8e8c483db84b4bee98b60c0593521ed34d9990e8
with:
coverageFile: smallfiles/coverage.xml
token: ${{ secrets.GITHUB_TOKEN }}
thresholdAll: 0.7
thresholdNew: 0.9
persist-credentials: false
fetch-depth: 0

- uses: actions/upload-artifact@v4
if: github.event_name == 'pull_request'
- name: Set up Python
uses: actions/setup-python@83679a892e2d95755f2dac6acb0bfd1e9ac5d548
with:
name: coverage
path: coverage.xml
retention-days: 1
python-version: '3.13'
cache: 'pip'

- name: Install dependencies
run: pip install -r smallfiles/requirements.txt

python-format-check:
name: Python Format Check
- name: Check code format with Black
id: check-format
run: black --check $(git ls-files '*.py')

mypy-check:
name: Mypy Type Check
needs: detect
if: needs.detect.outputs.python_changed == 'true'
runs-on: ubuntu-latest
steps:
- name: Checkout repository
uses: actions/checkout@v4
uses: actions/checkout@8e8c483db84b4bee98b60c0593521ed34d9990e8
with:
persist-credentials: false
fetch-depth: 0

- name: Set up Python
uses: actions/setup-python@v5
uses: actions/setup-python@83679a892e2d95755f2dac6acb0bfd1e9ac5d548
with:
python-version: '3.11'
python-version: '3.13'
cache: 'pip'

- name: Install dependencies
run: |
pip install black

- name: Check code format with Black
run: |
black --check $(git ls-files '*.py')

run: pip install -r smallfiles/requirements.txt

- name: Check types with Mypy
id: check-types
run: mypy .

python-static-analysis:
name: Python Static Analysis
unit-tests:
name: Pytest Tests with Coverage
needs: detect
if: needs.detect.outputs.python_changed == 'true'
runs-on: ubuntu-latest
defaults:
run:
working-directory: smallfiles
steps:
- name: Checkout repository
uses: actions/checkout@v4
uses: actions/checkout@8e8c483db84b4bee98b60c0593521ed34d9990e8
with:
persist-credentials: false
fetch-depth: 0

- name: Set up Python
uses: actions/setup-python@v5
uses: actions/setup-python@83679a892e2d95755f2dac6acb0bfd1e9ac5d548
with:
python-version: '3.11'
python-version: '3.13'
cache: 'pip'

- name: Install dependencies
run: |
pip install -r requirements.txt
pip install pylint
- name: Install Python dependencies
run: pip install -r smallfiles/requirements.txt

- name: Analysing the code with pylint
run: |
pylint $(git ls-files '*.py')
- name: Check code coverage with Pytest
run: pytest --cov=smallfiles -v smallfiles/test/ --cov-fail-under=80

noop:
name: No Operation
needs: detect
if: needs.detect.outputs.python_changed != 'true'
runs-on: ubuntu-latest
steps:
- run: echo "No changes in the *.py files — passing."
Loading