Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Insert should reject repetitive inserts with same key #2

Open
tzolov opened this issue Mar 22, 2017 · 1 comment
Open

Insert should reject repetitive inserts with same key #2

tzolov opened this issue Mar 22, 2017 · 1 comment

Comments

@tzolov
Copy link
Owner

tzolov commented Mar 22, 2017

Existing JournalledInsertRule implementation is designed for high throughput and minimal overhead.
Rule converts INSERT on virtual table into INSERT on the target journal table. In practice the performance of the SQL-Rewriter INSERT statement should match this of a standard HAWQ Insert operation.

The disadvantage of current design is that it will not prevent repetitive INSERTS on of rows with same-key.
This can be resolved by Insert rule implementation that uses INSERT/SELECT semantics similar to this of the Update and Delete rules. The disadvantage of this approach is the increased performance overhead.

As a trade off i can keep both rules and allow the User to configure the one or the other depends on the use case

@davidje13
Copy link
Collaborator

Note that this issue only affects use-cases where the primary key is being given an explicit value (rather than relying on an autoincrementing index, which of course prevents colliding IDs). Also any systems which generate UUIDs should be OK (ignoring type-4-collisions at least).

Also note that this issue is different depending on the choice of version column:

  • TIMESTAMP will allow duplicate items with the same primary key
  • BIGINT will block duplicate items (unless old data is archived away), but will also block INSERT-DELETE-INSERT (i.e. once an ID has been taken, it will not be available again even after deleting)

If a safer implementation is added, I would suggest: (this reasoning is for TIMESTAMP; I haven't given BIGINT much thought)

  • Add a configuration option (maybe journalled_insert_mode or similar) with 3 values; something like: "fast", "balanced" and "safe". Set to "balanced" by default.
  • When an INSERT is detected:
    • If "fast", always use existing fast method
    • If "safe", always use new safe method
    • If "balanced", use new safe method if one of the primary key columns has been given an explicit value, and fast method otherwise

Note that "balanced" is likely to be best for most use cases, but will not catch cases such as:

  • auto-incrementing ID adds "4", "5"
  • manual query adds "7" (safe insert used; deemed OK)
  • auto-incrementing ID adds "6", "7" (fast insert used; oops)

Which is what "safe" would exist for.


For creating the new functionality, there are some existing tests for the insert rule in insertIntegrationTest.java; look for and remove the Assume.assumeTrue lines to enable them (currently one of the tests is enabled only for TIMESTAMP, and the other only for BIGINT, due to the different behaviour described above)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants