Skip to content

insert_or_update can have problems with multiple "issue"s #1427

Open
@melange396

Description

@melange396

insert_or_update() (which is really both insert_or_update_batch() and insert_or_update_bulk() right now; they should be resolved into one in #989) can have problems when given a set of rows that includes more than one issue. It is not that multiple issue values themselves are problematic, but the desired result may not be achieved when there are multiple rows for the same source/signal/time/geo that have different issues.

The problem stems from the fact that the SQL code in fix_is_latest_issue_sql kinda naively inherently assumes only one issue will be present for each (source, signal, geo_type, geo_value, time_type, time_value). It should be noted that it sort of presumes most rows inserted will be a latest issue, which is why the is_latest_issue column defaults to 1 in insert_into_loader_sql. It should also be noted that "overwrites" will happen when load.issue>latest.issue, but ALSO when they are ==, which lets us update/patch previously-written values that may have been incorrect (then the SQL that moves the load table data to latest and history tables has an ON DUPLICATE clause that preserves the unique key epimetric_id for the same unique composite key tuple).

In practice, the problematic behavior does not happen because the only time insert_or_update() is called (in csv_to_database.py:upload_archive()), it is handed rows returned by CsvImporter.load_csv(), which only uses a single issue value at a time. However, it is possible that we might write something else in the future that does not account for this expected invariance, so we should take steps to resolve it now.

Non-compliant calls to insert_or_update() can still happen in test code via calls to _insert_rows(). This github search provides a list of them.

Possible ways to address this:

  • add a constraint to the load table to enforce that no multiple-issue collisions happen
    • can be done just by removing issue from its UNIQUE INDEX)
    • doesnt actually solve the problem, but does prevent us from suffering the ill effects
  • fix the fix_is_latest_issue_sql calculation to work with multiple issues
    • perhaps by incorporating a GROUP BY
    • perhaps by inserting a SQL statement before it that does its own GROUP BY that sets is_latest_issue=0 for non-maximal issue rows.

Doing unit/integration testing for these conditions should be ~easy, and some work on this appears to have already been done in #925

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions