Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

#2007: PG data model and functions for Schema #2008

Open
wants to merge 53 commits into
base: develop
Choose a base branch
from

Conversation

benedeki
Copy link
Collaborator

@benedeki benedeki commented Jan 11, 2022

  • database and database users creation scripts
  • global_id function
  • basic tables for entities
  • tables to store the schema entities
  • basic function to work with schema entities
  • first table for statistics

Closes #2007

This is a first suggestion how the Postgres DB for Enceladus should look like, but also the image, how I would like sus to model our database code. Of course that the names and/or logic are not up to question - opposite is true. Please comment, disagree, ask!

Deployment:
The scripts are to be run by a super-user, particularly the first ones.

  1. First execute the scripts in the root.
  2. Then follow by folders
  3. Each folder represent one DB schema - it contains the file _schema.ddl which is to create the schema
  4. Rest are the tables and functions - tables have the extension .ddl, functions .sql.
  5. .sql files are idempotent, can be redeployed repeatedly.
  6. .ddl files are not indempotent,, but usually contain a commented out DROP command which allows redeployment but with a potential data loss.

Then end goal is to have .sql files for functions and .json files describing the tables. A deployment tool then would generate the "migration" to ensure the table are up to the description.

P.S.
Tables in their current format are missing descriptions. Eventually should have them too.

benedeki and others added 30 commits March 13, 2021 23:54
* 1422 and 1423 Remove HDFS and Oozie from Menas

* #1422 Fix HDFS location validation

* #1424 Add Menas Dockerfile

* #1416 hadoop-aws 2.8.5 + s3 aws sdk 2.13.65 compiles.

* #1416 - enceladus on S3:

 - all directly-hdfs touching stuff disabled (atum, performance measurements, info files, output path checking)

# Add menasfargate into hosts
sudo nano /etc/hosts
# paste
20.0.63.69 menasfargate
# save & exit (ctrl+O, ctrl+X)

# Running standardization works via:
spark-submit --class za.co.absa.enceladus.standardization.StandardizationJob --conf "spark.driver.extraJavaOptions=-Dmenas.rest.uri=http://menasfargate:8080 -Dstandardized.hdfs.path=s3://euw1-ctodatadev-dev-bigdatarnd-s3-poc/enceladusPoc/ao-hdfs-data/stdOutput/standardized-{0}-{1}-{2}-{3}" ~/enceladusPoc/spark-jobs-2.11.0-SNAPSHOT.jar --menas-credentials-file ~/enceladusPoc/menas-credentials.properties --dataset-name dk_test1_emr285 --raw-format json --dataset-version 1 --report-date 2019-11-27 --report-version 1 2> ~/enceladusPoc/stderr.txt

* #1416 - enceladus on S3 - (crude) conformance works on s3 (s3 std input, s3 conf output)

 0- all directly-hdfs touching stuff disabled (atum, performance measurements, info files, output path checking)

# Add menasfargate into hosts
sudo nano /etc/hosts
# paste
20.0.63.69 menasfargate
# save & exit (ctrl+O, ctrl+X)

# Running conformance works via:
spark-submit --class za.co.absa.enceladus.conformance.DynamicConformanceJob --conf "spark.driver.extraJavaOptions=-Dmenas.rest.uri=http://menasfargate:8080 -Dstandardized.hdfs.path=s3://euw1-ctodatadev-dev-bigdatarnd-s3-poc/enceladusPoc/ao-hdfs-data/stdOutput/standardized-{0}-{1}-{2}-{3}" ~/enceladusPoc/spark-jobs-2.11.0-SNAPSHOT.jar --menas-credentials-file ~/enceladusPoc/menas-credentials.properties --dataset-name dk_test1_emr285 --dataset-version 1 --report-date 2019-11-27 --report-version 1 2> ~/enceladusPoc/conf-log.txt

* ref issue = 1416

* related test cases ignored (issue reference added)

* PR updates

* Merge spline 0.5.3 into aws-poc

* Update spline to 0.5.4 for AWS PoC

* #1503 Remove HDFS url Validation

This is a temporary solution. We currently experiment with
many forms of URLs, and having a regex there now slows us down.

* New dockerfile - smaller image

* s3 persistence (atum, sdk fs usage, ...) (#1526)

#1526 
* FsUtils divided into LocalFsUtils & HdfsUtils
* PathConfigSuite update
* S3FsUtils with tail-recursive pagination accumulation - now generic with optional short-circuit breakOut
TestRunnerJob updated to manually cover the cases - should serve as a basis for tests
* HdfsUtils replace by trait DistributedFsUtils (except for MenasCredentials loading & nonSplittable splitting)
* using final version of s3-powered Atum (3.0.0)
* mockito-update version update, scalatest version update
* S3FsUtilsSuite: exists, read, sizeDir(hidden, non-hidden, reucursive), non-splittable (simple, recursive with breakOut), delete (recursive), version find (simple - empty, recursive)
* explicit stubbing fix for hyperdrive

* Feature/1556 file access PoC using Hadoop FS API (#1586)

* s3 using hadoop fs api
* s3 sdk usage removed (pom, classes, tests)
* atum final version 3.1.0 used
* readStandardizationInputData(... path: String)(implicit ... fs: FileSystem) -> readStandardizationInputData(input: PathWithFs)

* 1554 Tomcat with TLS in Docker container (#1585)

* #1554 Tomcat with TLS container

* #1554 Added envoy config + enabling running unencrypted container

* #1499 Add authentication to /lineage + update spline to 0.5.5

* #1618 - fixes failing spline 0.5.5 integration by providing compatible commons library version. Test-ran on EMR. (#1619)

* #1612 Separation start

* #1612 Updated DAO for spark-jobs

* #1612 Fixed spline integration and schema, removed redundant code

* #1612 Fixed tests, removed unused dependency

* #1612 Added back dependency

* WIP fixing merge issues

* * Merge compiles
* Tests pass
* Depends on ATUM 3.1.1-SNAPSHOT (the bugfix for AbsaOSS/atum#48)

* #1612 Removed Spring from menas-web, enabled building war and static resources. Removed version subpath in menas-web + added theme dependencies in repo

* #1612 Cookies + updated lineage

* * put back HDFS browser
* put back Oozie
* downgraded Spline

* * AWS SDK Exclusion

* #1612 Included HDFSFolder + missing Oozie parts

* * New ATUM version

* * Adding missing files

* #1612 menas-web on nginx container and passing API_URL

* #1612 Working TLS on nginx, resources not included in code

* 1622: Merge of aws-poc to develop brach
* Addressed issues identified by reviewers

* * comments improvement

* 1434 Add new way of serving properties to Docker

* #1612 Building using ui5 + reused /api route

* #1612 Project version

* #713 Add favicon

* #1612 Merges

* #1612 pom parent version

* #1648 Fix war deployment + adding back spline to menas

* #1612 other fixes

* #1612 added pom package.json version sync

* #1612 newline

* #1612 fix version sync + cleaning dist

* 1648 merge to develop

* 1648 merge fix

* 1648 Fixes schema upload

* 1648 Fixes schema registry request

* 1648 pom version

* 1612 add docker build

* #601 Swagger 2 PoC

* #601 Swagger 2 PoC

* #601 Swagger 2 PoC

* #1648 Updating menas-web to 3.0

* #1612 Updated npm project versions + mvn plugin

* #1612 license_check.yml

* #1612 licence check fix

Co-authored-by: Saša Zejnilović <[email protected]>
Co-authored-by: Daniel Kavan <[email protected]>
Co-authored-by: Jan Scherbaum <[email protected]>
Co-authored-by: David Benedeki <[email protected]>
% Conflicts:
%	data-model/src/main/scala/META-INF/MANIFEST.MF
%	menas-web/ui/components/dataset/conformanceRule/MappingConformanceRule/targetSchemaFieldSelector.fragment.xml
%	menas/ui/index.html
…velop-ver.30

# Conflicts:
#	menas/src/main/scala/za/co/absa/enceladus/menas/MvcConfig.scala
merging develop into develop-ver3.0 (via  feature/merging-develop-ver.30 branch)
KafkaErrorSenderPluginSuite test fix for SourcePhase.{Standardization, Conformance} capitalization.
* #1769 rename menas to rest-api

* #1769 README update and other feedback
* #1733 final version 0.6.0 of spline spark agent used
* #1733 cherrypicked version: commons 0.0.27, atum 3.5.0, spline agent 0.6.0-snapshot
* #1733 PR touchup: maven.shade.version version revert to 3.2.1
* #1770 Rename menas-web to menas
% Conflicts:
%	dao/pom.xml
%	data-model/pom.xml
%	examples/pom.xml
%	menas/pom.xml
%	migrations-cli/pom.xml
%	migrations/pom.xml
%	plugins-api/pom.xml
%	plugins-builtin/pom.xml
%	pom.xml
%	spark-jobs/pom.xml
%	utils/pom.xml
Spline 0.6 integration
% Conflicts:
%	menas/pom.xml
%	menas/ui/index.html
%	pom.xml
%	rest-api/src/main/scala/za/co/absa/enceladus/rest_api/WebSecurityConfig.scala
%	rest-api/src/main/scala/za/co/absa/enceladus/rest_api/controllers/DatasetController.scala
% Conflicts:
%	menas/pom.xml
…elate to rest_api, not menas: `/3rdParty/**` should not be added anymore as it is menas-related)
* Adding back Menas module, that somehow got omitted.
…lop-ver-3

Merging Release 2.23. 0 into develop ver 3
@benedeki benedeki force-pushed the feature/2007-pg-data-model-and-functions-for-schema branch from 224552d to e8fa5cf Compare January 12, 2022 02:38
dk1844
dk1844 previously approved these changes Jan 12, 2022
Copy link
Contributor

@dk1844 dk1844 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am by far no expert in SQL (and even less in PG SQL), so I am just mentioning some things I have found a bit unusual (just read the code)

Comment on lines 16 to 18
DROP FUNCTION IF EXISTS dataset_schema.add_schema(TEXT, INTEGER, TEXT, JSONB, TEXT);

CREATE OR REPLACE FUNCTION dataset_schema.add_schema(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why dropping the function when you later offer to replace it anyway? (wouldn't either DROP+CREATE or just CREATE or REPLACE suffice?)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are right. After the DROP CREATE would be enough. I can change that for sure.
(My template is set up with CREATE or REPLACE, but unfortunately it cannot handle if the OUT parameters change - I think it used to have.)

Comment on lines 49 to 52
-- 403 - Schema already exists
-- 404 - Schema version wrong
-- 405 - Schema has been disabled
-- 406 - Schema is locked
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks like HTTP/REST status codes compatible use, but their usage is quite non-standard. We should not be inventing status codes, I think. How about:

--      201     - OK
--      409     - (conflict) Schema already exists
--      404     - (not found) Schema version wrong
--      403     - (forbidden) Schema has been disabled
--      403     - (forbidden) Schema is locked

By that I mean:

  • 404 is clear - incorrect path = not found
  • 409 Conflict is typical for out-of-sync edits
  • for the rest, 403 - forbidden (i.e. request is valid, but the state of the entity does not allow to continue. Here, the difference would have to be documented by the status_text.

If the above is unsatisfactory, perhaps 432 Locked from WebDAV use could be misused, it is pretty close:

--      409     - (conflict) Schema already exists
--      404     - (not found) Schema version wrong
--      403     - (forbidden) Schema has been disabled
--      423     - (locked) Schema is locked (maybe)

A relevant info source: https://www.restapitutorial.com/httpstatuscodes.html

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am most happy to change the status codes. I didn't put too much effort into their choice. But I would avoid to give the same code to different (failure) reasons, where the text distinguishes them - the first suggestion. The second sounds good, will implement it.

-- Status codes:
-- 200 - OK
-- 404 - Schema does not exist
-- 405 - Schema of the given version does not exist
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

my vote here is to use std 404 code and just differentiate by the status_text

Suggested change
-- 405 - Schema of the given version does not exist
-- 404 - Schema of the given version does not exist

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am happy to use better code, but not the same one, 404.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand that you want to distinguish the cases here, but I still believe that 404 is appropriate for both.
The code describes entity requested is not found, how ever defined. So,
get_schema(existingSchema, notExistingVersion) does not exist the same way as get_schema(bogusSchema, regardlessVersion).

Btw. just schema existence, regardless of the version (something that that 405 should signify), can be checked by get_schema(existingSchema, NULL) if I read the code correctly.

405 is Method Not Allowed (by which HTTP methods like GET, POST, ... is meant) so it does not seem to be fitting. So it would be nice to find an alternative status code with a proper meaning, but they don't just give out any 404B's :)

database/src/main/public/global_id.sql Outdated Show resolved Hide resolved
*/


--DROP SEQUENCE IF EXISTS public.global_id_seq
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe uncomment?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as in the case of tables - prevent data loss, potential ID conflict.

database/src/main/public/global_id.sql Outdated Show resolved Hide resolved
* limitations under the License.
*/

-- DROP TABLE dataset_schema.heads;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
-- DROP TABLE dataset_schema.heads;
DROP TABLE IF EXISTS dataset_schema.heads;

And the same elsewhere, perhaps?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IF EXISTS sure.
But it's commented out on purpose, so an accidental run won't result in data loss, just an error of trying to create an already existing table.

* added disabled flag in return of list_schemas.sql

CREATE TABLE stats.jobs_configurations
(
schema_count INTEGER NOT NULL,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was supposed to be landing page statistics? If yes, could you include runs, properties? Also, not sure about the naming

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think some procedures to compute the other fiields like today's run statistics, total number of missing recommended and mandatory properties could be implemented

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adding runs - sure the table can be enhanced easily. But actually I rather expect runs to be in a separate table table. It's update far more frequently, and to avoid blocking and undesired queuing of the requests and little different apraoch might be needed.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Renaming though is a very good comment - what about ENTITIES (I am going to use that, feel free to suggest other. 😉

IN i_include_disabled BOOLEAN DEFAULT FALSE,
OUT schema_name TEXT,
OUT schema_latest_version INTEGER,
OUT disabled BOOLEAN
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add locked here as well?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can enhance it if such data are needed. (Adding new columns is backward compatible (older app can qurry such a function without loss of functionality). Right now it seems unnecessary.
disabled was added when I realized it could be little confusing adding the input parameter true but not knowing which are which.

Copy link
Contributor

@Zejnilovic Zejnilovic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not done, just need a restart.

database/src/main/02_databases.ddl Show resolved Hide resolved
database/src/main/dataset_schema/add_schema.sql Outdated Show resolved Hide resolved
database/src/main/dataset_schema/add_schema.sql Outdated Show resolved Hide resolved
database/src/main/dataset_schema/add_schema.sql Outdated Show resolved Hide resolved
INTO status, status_text, id_schema, schema_name, schema_version,
schema_description, fields, created_by, created_when,
updated_by, updated_when, locked_by, locked_when,
disabled_by, disabled_when;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
disabled_by, disabled_when;
disabled_by, disabled_when;

Comment on lines 73 to 76
ELSIF _disabled THEN
status := 405;
status_text := 'Schema has been disabled';
RETURN ;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Our current Menas re-enables the disabled version. We should come up with a stance on that

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Un/locking and En/disabling would be two new functions. Only change the values in the heads table. If we want to keep the I would think to have a journal table.

database/src/main/dataset_schema/add_schema.sql Outdated Show resolved Hide resolved
database/src/main/dataset_schema/get_schema.sql Outdated Show resolved Hide resolved
SELECT dsh.schema_name, dsh.schema_latest_version, dsh.disabled_when IS NOT NULL
FROM dataset_schema.heads dsh
WHERE i_include_disabled OR dsh.disabled_when IS NULL
ORDER BY schema_name; --TODO Include order by?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Current implementation is ordered by version

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a list of schemas. It doesn't make sense to have it ordered by the schema version when only the latest version number is returned. But reminds me, that a function returning the history (all versions) of a schema will probably be needed.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤦‍♂️ Sorry, name, it is ordered by name.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a list of schemas. It doesn't make sense to have it ordered by the schema version when only the latest version number is returned. But reminds me, that a function returning the history (all versions) of a schema will probably be needed.

I presume these are not all the functions. If they are supposed to be, I will supply the list I come up with

database/src/main/public/global_id.sql Outdated Show resolved Hide resolved
* stats table renamed
* dataset_schema.schemas -> dataset_schema.versions
@sonarcloud
Copy link

sonarcloud bot commented Jan 14, 2022

Kudos, SonarCloud Quality Gate passed!    Quality Gate passed

Bug A 0 Bugs
Vulnerability A 0 Vulnerabilities
Security Hotspot A 0 Security Hotspots
Code Smell A 0 Code Smells

No Coverage information No Coverage information
No Duplication information No Duplication information

Adrian-Olosutean and others added 2 commits March 9, 2022 22:05
* New class `TZNormalizedSparkTestBase` used over all tests instead of `SparkTestBase`
@benedeki benedeki marked this pull request as draft March 28, 2022 10:13
benedeki and others added 3 commits April 10, 2022 23:29
* rename of schema files
#2050: Unifying project space with GitHub defaults
* docs -> documentation
* Milestones are obsolete
* separated DB setup
* `..._when` renamed to `..._at`
@benedeki benedeki added the work in progress Work on this item is not yet finished (mainly intended for PRs) label Apr 19, 2022
* shorter function names
@benedeki benedeki removed the work in progress Work on this item is not yet finished (mainly intended for PRs) label Apr 19, 2022
@benedeki benedeki requested a review from AB006J9 April 19, 2022 21:27
@benedeki benedeki marked this pull request as ready for review April 19, 2022 21:29
benedeki and others added 4 commits April 19, 2022 23:38
* reordering of OUT parameters in `get`
* #2034: PG data model and functions for Mapping tables
* the tables

* Rework for more common naming

* * adjust to inherited table

* * fixes

* * mapping table: path->table_path

* * refactored to use connection of entities and versions via id

* * white spaces

* * add calls _add fix

* * check on entities

* #2035: PG data model and functions for Dataset (#2054)

* #2035: PG data model and functions for Dataset
* First commit

* * Added check on entity type

* * get fucntion
* dataset fields

* * get comment fixes

* * refactored to use id in connection of entities and versions
* add function
@sonarcloud
Copy link

sonarcloud bot commented Apr 26, 2022

Kudos, SonarCloud Quality Gate passed!    Quality Gate passed

Bug A 0 Bugs
Vulnerability A 0 Vulnerabilities
Security Hotspot A 0 Security Hotspots
Code Smell A 0 Code Smells

No Coverage information No Coverage information
No Duplication information No Duplication information

Base automatically changed from develop-ver-3.0 to develop June 3, 2022 13:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants