-
Notifications
You must be signed in to change notification settings - Fork 14
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
#2007: PG data model and functions for Schema #2008
base: develop
Are you sure you want to change the base?
#2007: PG data model and functions for Schema #2008
Conversation
* 1422 and 1423 Remove HDFS and Oozie from Menas * #1422 Fix HDFS location validation * #1424 Add Menas Dockerfile * #1416 hadoop-aws 2.8.5 + s3 aws sdk 2.13.65 compiles. * #1416 - enceladus on S3: - all directly-hdfs touching stuff disabled (atum, performance measurements, info files, output path checking) # Add menasfargate into hosts sudo nano /etc/hosts # paste 20.0.63.69 menasfargate # save & exit (ctrl+O, ctrl+X) # Running standardization works via: spark-submit --class za.co.absa.enceladus.standardization.StandardizationJob --conf "spark.driver.extraJavaOptions=-Dmenas.rest.uri=http://menasfargate:8080 -Dstandardized.hdfs.path=s3://euw1-ctodatadev-dev-bigdatarnd-s3-poc/enceladusPoc/ao-hdfs-data/stdOutput/standardized-{0}-{1}-{2}-{3}" ~/enceladusPoc/spark-jobs-2.11.0-SNAPSHOT.jar --menas-credentials-file ~/enceladusPoc/menas-credentials.properties --dataset-name dk_test1_emr285 --raw-format json --dataset-version 1 --report-date 2019-11-27 --report-version 1 2> ~/enceladusPoc/stderr.txt * #1416 - enceladus on S3 - (crude) conformance works on s3 (s3 std input, s3 conf output) 0- all directly-hdfs touching stuff disabled (atum, performance measurements, info files, output path checking) # Add menasfargate into hosts sudo nano /etc/hosts # paste 20.0.63.69 menasfargate # save & exit (ctrl+O, ctrl+X) # Running conformance works via: spark-submit --class za.co.absa.enceladus.conformance.DynamicConformanceJob --conf "spark.driver.extraJavaOptions=-Dmenas.rest.uri=http://menasfargate:8080 -Dstandardized.hdfs.path=s3://euw1-ctodatadev-dev-bigdatarnd-s3-poc/enceladusPoc/ao-hdfs-data/stdOutput/standardized-{0}-{1}-{2}-{3}" ~/enceladusPoc/spark-jobs-2.11.0-SNAPSHOT.jar --menas-credentials-file ~/enceladusPoc/menas-credentials.properties --dataset-name dk_test1_emr285 --dataset-version 1 --report-date 2019-11-27 --report-version 1 2> ~/enceladusPoc/conf-log.txt * ref issue = 1416 * related test cases ignored (issue reference added) * PR updates * Merge spline 0.5.3 into aws-poc * Update spline to 0.5.4 for AWS PoC * #1503 Remove HDFS url Validation This is a temporary solution. We currently experiment with many forms of URLs, and having a regex there now slows us down. * New dockerfile - smaller image * s3 persistence (atum, sdk fs usage, ...) (#1526) #1526 * FsUtils divided into LocalFsUtils & HdfsUtils * PathConfigSuite update * S3FsUtils with tail-recursive pagination accumulation - now generic with optional short-circuit breakOut TestRunnerJob updated to manually cover the cases - should serve as a basis for tests * HdfsUtils replace by trait DistributedFsUtils (except for MenasCredentials loading & nonSplittable splitting) * using final version of s3-powered Atum (3.0.0) * mockito-update version update, scalatest version update * S3FsUtilsSuite: exists, read, sizeDir(hidden, non-hidden, reucursive), non-splittable (simple, recursive with breakOut), delete (recursive), version find (simple - empty, recursive) * explicit stubbing fix for hyperdrive * Feature/1556 file access PoC using Hadoop FS API (#1586) * s3 using hadoop fs api * s3 sdk usage removed (pom, classes, tests) * atum final version 3.1.0 used * readStandardizationInputData(... path: String)(implicit ... fs: FileSystem) -> readStandardizationInputData(input: PathWithFs) * 1554 Tomcat with TLS in Docker container (#1585) * #1554 Tomcat with TLS container * #1554 Added envoy config + enabling running unencrypted container * #1499 Add authentication to /lineage + update spline to 0.5.5 * #1618 - fixes failing spline 0.5.5 integration by providing compatible commons library version. Test-ran on EMR. (#1619) * #1612 Separation start * #1612 Updated DAO for spark-jobs * #1612 Fixed spline integration and schema, removed redundant code * #1612 Fixed tests, removed unused dependency * #1612 Added back dependency * WIP fixing merge issues * * Merge compiles * Tests pass * Depends on ATUM 3.1.1-SNAPSHOT (the bugfix for AbsaOSS/atum#48) * #1612 Removed Spring from menas-web, enabled building war and static resources. Removed version subpath in menas-web + added theme dependencies in repo * #1612 Cookies + updated lineage * * put back HDFS browser * put back Oozie * downgraded Spline * * AWS SDK Exclusion * #1612 Included HDFSFolder + missing Oozie parts * * New ATUM version * * Adding missing files * #1612 menas-web on nginx container and passing API_URL * #1612 Working TLS on nginx, resources not included in code * 1622: Merge of aws-poc to develop brach * Addressed issues identified by reviewers * * comments improvement * 1434 Add new way of serving properties to Docker * #1612 Building using ui5 + reused /api route * #1612 Project version * #713 Add favicon * #1612 Merges * #1612 pom parent version * #1648 Fix war deployment + adding back spline to menas * #1612 other fixes * #1612 added pom package.json version sync * #1612 newline * #1612 fix version sync + cleaning dist * 1648 merge to develop * 1648 merge fix * 1648 Fixes schema upload * 1648 Fixes schema registry request * 1648 pom version * 1612 add docker build * #601 Swagger 2 PoC * #601 Swagger 2 PoC * #601 Swagger 2 PoC * #1648 Updating menas-web to 3.0 * #1612 Updated npm project versions + mvn plugin * #1612 license_check.yml * #1612 licence check fix Co-authored-by: Saša Zejnilović <[email protected]> Co-authored-by: Daniel Kavan <[email protected]> Co-authored-by: Jan Scherbaum <[email protected]> Co-authored-by: David Benedeki <[email protected]>
% Conflicts: % data-model/src/main/scala/META-INF/MANIFEST.MF % menas-web/ui/components/dataset/conformanceRule/MappingConformanceRule/targetSchemaFieldSelector.fragment.xml % menas/ui/index.html
…velop-ver.30 # Conflicts: # menas/src/main/scala/za/co/absa/enceladus/menas/MvcConfig.scala
merging develop into develop-ver3.0 (via feature/merging-develop-ver.30 branch)
KafkaErrorSenderPluginSuite test fix for SourcePhase.{Standardization, Conformance} capitalization.
* #601 Swagger api docs
…XMLHack removed, the test holds. (#1783)
* #1770 Rename menas-web to menas
% Conflicts: % dao/pom.xml % data-model/pom.xml % examples/pom.xml % menas/pom.xml % migrations-cli/pom.xml % migrations/pom.xml % plugins-api/pom.xml % plugins-builtin/pom.xml % pom.xml % spark-jobs/pom.xml % utils/pom.xml
Spline 0.6 integration
% Conflicts: % menas/pom.xml % menas/ui/index.html % pom.xml % rest-api/src/main/scala/za/co/absa/enceladus/rest_api/WebSecurityConfig.scala % rest-api/src/main/scala/za/co/absa/enceladus/rest_api/controllers/DatasetController.scala
% Conflicts: % menas/pom.xml
…elate to rest_api, not menas: `/3rdParty/**` should not be added anymore as it is menas-related)
* Adding back Menas module, that somehow got omitted.
…lop-ver-3 Merging Release 2.23. 0 into develop ver 3
224552d
to
e8fa5cf
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am by far no expert in SQL (and even less in PG SQL), so I am just mentioning some things I have found a bit unusual (just read the code)
DROP FUNCTION IF EXISTS dataset_schema.add_schema(TEXT, INTEGER, TEXT, JSONB, TEXT); | ||
|
||
CREATE OR REPLACE FUNCTION dataset_schema.add_schema( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why dropping the function when you later offer to replace it anyway? (wouldn't either DROP+CREATE or just CREATE or REPLACE suffice?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You are right. After the DROP
CREATE
would be enough. I can change that for sure.
(My template is set up with CREATE or REPLACE
, but unfortunately it cannot handle if the OUT
parameters change - I think it used to have.)
-- 403 - Schema already exists | ||
-- 404 - Schema version wrong | ||
-- 405 - Schema has been disabled | ||
-- 406 - Schema is locked |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks like HTTP/REST status codes compatible use, but their usage is quite non-standard. We should not be inventing status codes, I think. How about:
-- 201 - OK
-- 409 - (conflict) Schema already exists
-- 404 - (not found) Schema version wrong
-- 403 - (forbidden) Schema has been disabled
-- 403 - (forbidden) Schema is locked
By that I mean:
- 404 is clear - incorrect path = not found
- 409 Conflict is typical for out-of-sync edits
- for the rest, 403 - forbidden (i.e. request is valid, but the state of the entity does not allow to continue. Here, the difference would have to be documented by the
status_text
.
If the above is unsatisfactory, perhaps 432 Locked
from WebDAV use could be misused, it is pretty close:
-- 409 - (conflict) Schema already exists
-- 404 - (not found) Schema version wrong
-- 403 - (forbidden) Schema has been disabled
-- 423 - (locked) Schema is locked (maybe)
A relevant info source: https://www.restapitutorial.com/httpstatuscodes.html
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am most happy to change the status codes. I didn't put too much effort into their choice. But I would avoid to give the same code to different (failure) reasons, where the text distinguishes them - the first suggestion. The second sounds good, will implement it.
-- Status codes: | ||
-- 200 - OK | ||
-- 404 - Schema does not exist | ||
-- 405 - Schema of the given version does not exist |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
my vote here is to use std 404 code and just differentiate by the status_text
-- 405 - Schema of the given version does not exist | |
-- 404 - Schema of the given version does not exist |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am happy to use better code, but not the same one, 404.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I understand that you want to distinguish the cases here, but I still believe that 404 is appropriate for both.
The code describes entity requested is not found, how ever defined. So,
get_schema(existingSchema, notExistingVersion)
does not exist the same way as get_schema(bogusSchema, regardlessVersion)
.
Btw. just schema existence, regardless of the version (something that that 405 should signify), can be checked by get_schema(existingSchema, NULL)
if I read the code correctly.
405
is Method Not Allowed (by which HTTP methods like GET, POST, ... is meant) so it does not seem to be fitting. So it would be nice to find an alternative status code with a proper meaning, but they don't just give out any 404B
's :)
*/ | ||
|
||
|
||
--DROP SEQUENCE IF EXISTS public.global_id_seq |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe uncomment?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same as in the case of tables - prevent data loss, potential ID conflict.
* limitations under the License. | ||
*/ | ||
|
||
-- DROP TABLE dataset_schema.heads; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
-- DROP TABLE dataset_schema.heads; | |
DROP TABLE IF EXISTS dataset_schema.heads; |
And the same elsewhere, perhaps?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IF EXISTS
sure.
But it's commented out on purpose, so an accidental run won't result in data loss, just an error of trying to create an already existing table.
* added disabled flag in return of list_schemas.sql
|
||
CREATE TABLE stats.jobs_configurations | ||
( | ||
schema_count INTEGER NOT NULL, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was supposed to be landing page statistics? If yes, could you include runs, properties? Also, not sure about the naming
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think some procedures to compute the other fiields like today's run statistics, total number of missing recommended and mandatory properties could be implemented
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Adding runs - sure the table can be enhanced easily. But actually I rather expect runs to be in a separate table table. It's update far more frequently, and to avoid blocking and undesired queuing of the requests and little different apraoch might be needed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Renaming though is a very good comment - what about ENTITIES
(I am going to use that, feel free to suggest other. 😉
IN i_include_disabled BOOLEAN DEFAULT FALSE, | ||
OUT schema_name TEXT, | ||
OUT schema_latest_version INTEGER, | ||
OUT disabled BOOLEAN |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add locked here as well?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can enhance it if such data are needed. (Adding new columns is backward compatible (older app can qurry such a function without loss of functionality). Right now it seems unnecessary.
disabled
was added when I realized it could be little confusing adding the input parameter true
but not knowing which are which.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not done, just need a restart.
INTO status, status_text, id_schema, schema_name, schema_version, | ||
schema_description, fields, created_by, created_when, | ||
updated_by, updated_when, locked_by, locked_when, | ||
disabled_by, disabled_when; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
disabled_by, disabled_when; | |
disabled_by, disabled_when; |
ELSIF _disabled THEN | ||
status := 405; | ||
status_text := 'Schema has been disabled'; | ||
RETURN ; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Our current Menas re-enables the disabled version. We should come up with a stance on that
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Un/locking and En/disabling would be two new functions. Only change the values in the heads
table. If we want to keep the I would think to have a journal table.
SELECT dsh.schema_name, dsh.schema_latest_version, dsh.disabled_when IS NOT NULL | ||
FROM dataset_schema.heads dsh | ||
WHERE i_include_disabled OR dsh.disabled_when IS NULL | ||
ORDER BY schema_name; --TODO Include order by? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Current implementation is ordered by version
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a list of schemas. It doesn't make sense to have it ordered by the schema version when only the latest version number is returned. But reminds me, that a function returning the history (all versions) of a schema will probably be needed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🤦♂️ Sorry, name, it is ordered by name.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a list of schemas. It doesn't make sense to have it ordered by the schema version when only the latest version number is returned. But reminds me, that a function returning the history (all versions) of a schema will probably be needed.
I presume these are not all the functions. If they are supposed to be, I will supply the list I come up with
* stats table renamed
* dataset_schema.schemas -> dataset_schema.versions
Kudos, SonarCloud Quality Gate passed! |
* rename of schema files
* shorter function names
…nctions-for-schema
* reordering of OUT parameters in `get`
…ttps://github.com/AbsaOSS/enceladus into feature/2007-pg-data-model-and-functions-for-schema
* stats are row based not column based
* #2034: PG data model and functions for Mapping tables * the tables * Rework for more common naming * * adjust to inherited table * * fixes * * mapping table: path->table_path * * refactored to use connection of entities and versions via id * * white spaces * * add calls _add fix * * check on entities * #2035: PG data model and functions for Dataset (#2054) * #2035: PG data model and functions for Dataset * First commit * * Added check on entity type * * get fucntion * dataset fields * * get comment fixes * * refactored to use id in connection of entities and versions * add function
Kudos, SonarCloud Quality Gate passed! |
Closes #2007
This is a first suggestion how the Postgres DB for Enceladus should look like, but also the image, how I would like sus to model our database code. Of course that the names and/or logic are not up to question - opposite is true. Please comment, disagree, ask!
Deployment:
The scripts are to be run by a super-user, particularly the first ones.
_schema.ddl
which is to create the schema.ddl
, functions.sql
..sql
files are idempotent, can be redeployed repeatedly..ddl
files are not indempotent,, but usually contain a commented outDROP
command which allows redeployment but with a potential data loss.Then end goal is to have
.sql
files for functions and.json
files describing the tables. A deployment tool then would generate the "migration" to ensure the table are up to the description.P.S.
Tables in their current format are missing descriptions. Eventually should have them too.