Remove collection_file_item table #324

duncandewhurst · 2021-02-18T01:48:21Z

I see many of these errors in recent uk_contracts_finder collections (examples):

['(psycopg2.errors.UniqueViolation) duplicate key value violates unique constraint "unique_collection_file_identifiers"\nDETAIL:  Key (collection_id, filename)=(1914, page-2386.json) already exists.\n [SQL: \'INSERT INTO collection_file (collection_id, filename, url) VALUES (%(collection_id)s, %(filename)s, %(url)s) RETURNING collection_file.id\'] [parameters: {\'collection_id\': 1914, \'filename\': \'page-2386.json\', \'url\': \'https://www.contractsfinder.service.gov.uk/Published/Notices/OCDS/Search?order=desc&page=2386\'}] (Background on this error at: http://sqlalche.me/e/gkpj)']

I think this is telling me that the data already exists in the database. What (if anything) should helpdesk analysts do about these errors?

cc @odscrachel

The text was updated successfully, but these errors were encountered:

jpmckinney · 2021-02-18T02:31:34Z

What’s the error rate per total files? I think it’s just a race condition, which the new version of Kingfisher Process won’t have.

duncandewhurst · 2021-02-18T19:50:14Z

collection_file has ~2,500 entries and collection_file_item_errors has ~400 entries for the above error. Although, I'm not clear on the meaning of the number column in collection_file_item_erorrs, which sums to ~2,800.

jpmckinney · 2021-02-18T20:26:41Z

ContractsFinder returns a list of 100 release packages on each page (each containing a single release). The URL for the page is stored in collection_file, and then each package causes the addition of a collection_file_item, numbered from 1 to 100.

In the past, for ContractsFinder, Kingfisher Collect sent the original file as-is, and Kingfisher Process had to parse the array. For good reasons, Kingfisher Collect now sends the file items (the individual packages).

When Kingfisher Process receives a file item, it creates the file if it doesn't exist yet. If it receives multiple file items at once for the same file (which is occurring now), it might attempt to create the same file multiple times - causing an error.

This sort of issue occurs in a few places in the current version of Kingfisher Process, for which this level of concurrency hadn't been considered. It's an old issue that's occurring more frequently now (other spiders did send file items earlier).

For collection 1914, there are 2529 collection_file rows, 245,129 collection_file_item rows, and 244,726 release rows. So, you're missing 403 releases due to this issue. Is that a problem for your use case? You currently have 99.8% of the data.

In a few weeks, the new Kingfisher Process should be in place, which doesn't have this issue.

As further background, the idea behind collection_file and collection_file_item is to maintain provenance information in cases where a file contains multiple OCDS documents.

However, this could just as well be modelled as a collection_file table with a number column, which is 0 if the file itself is an OCDS document, and 1 and upwards if the file contains multiple OCDS documents.

If no one cares about collection_file_item (or even knows about or understands it), then removing it could simplify the Kingfisher Process architecture.

(Or, with the same outcome, we can remove collection_file, and have collection_file_item relate directly to collection.)

cc @jakubkrafka for awareness on this last point.

duncandewhurst · 2021-02-18T21:50:32Z

Thanks!

When Kingfisher Process receives a file item, it creates the file if it doesn't exist yet.

By 'creates a file' you mean an entry in the collection_file table, right?

For collection 1914, there are 2529 collection_file rows, 245,129 collection_file_item rows, and 244,726 release rows. So, you're missing 403 releases due to this issue. Is that a problem for your use case? You currently have 99.8% of the data.

It's not a critical problem, but I'm working on some standard error-checking queries to include in all analyst notebooks, so good to know that errors in collection_file_item_errors mean that Kingfisher Process failed to load some data and that, in this case at least, it's not due to an issue with the publisher's data.

jpmckinney · 2021-02-18T21:56:21Z

By 'creates a file' you mean an entry in the collection_file table, right?

Yes

jpmckinney · 2024-04-15T15:05:37Z

Noting that Process doesn't receive the number for a FileItem, and the number is presently always 0.

The number remains affixed to the filename, though it could be hard to disambiguate from the basename.

The number was originally useful when Process v1 received a full file (e.g. line-delimited JSON), which it then broke into FileItems. The number could then be used to index into the full file that was written to disk. Collect now guarantees individual packages and does the splitting on its side. So, there is no need for a number for indexing.

So, I've edited the issue title to not add number to collection_file.

jpmckinney · 2024-04-15T15:11:16Z

If we remove collection_file_item:

Change release, record and compiled_release to refer to the collection_file directly instead of through collection_file_item
Remove code in file_worker and compiler that creates collection_file_item rows (an essentially useless row, as it just links to the collection_file and sets number to 0)
Simplify collection_file_item__collection_file lookups
tests: Edit collection_file_item.collection_file code in test_compile_release.py
tests: Update complete_db.json

If we remove collection_file:

Move filename and url from collection_file onto collection_file_item
Change processing_step to refer to a collection_file_item
Change collection_file_item to refer to a collection
Change a lot more code, since collection_file_id is used in messages

Removing collection_file_item is simpler.

jpmckinney · 2024-11-02T03:11:34Z

PRs:

jpmckinney · 2024-11-02T05:31:13Z

Closed by 886159f

jpmckinney added the bug Something isn't working label Feb 18, 2021

jpmckinney added this to the V1 milestone Feb 18, 2021

duncandewhurst mentioned this issue Feb 18, 2021

Update items description #325

Merged

jpmckinney changed the title ~~psycopg2.errors.UniqueViolation - action for helpdesk analysts?~~ Remove collection_file_item table, and add number to collection_file table Aug 19, 2021

jpmckinney modified the milestones: V1, V3 Aug 19, 2021

jpmckinney added framework and removed bug Something isn't working labels Aug 19, 2021

jpmckinney added database Changes to the database (adding indices, renaming columns) and removed framework labels Jun 8, 2022

jpmckinney modified the milestones: V3, Priority Jun 8, 2022

jpmckinney added the refactor label Jul 10, 2023

jpmckinney modified the milestones: Database changes, Priority Apr 12, 2024

jpmckinney changed the title ~~Remove collection_file_item table, and add number to collection_file table~~ Remove collection_file_item table Apr 15, 2024

jpmckinney changed the title ~~Remove collection_file_item table~~ Remove either collection_file or collection_file_item table Apr 15, 2024

jpmckinney changed the title ~~Remove either collection_file or collection_file_item table~~ Remove either collection_file_item table Apr 15, 2024

jpmckinney changed the title ~~Remove either collection_file_item table~~ Remove collection_file_item table Apr 15, 2024

jpmckinney closed this as completed Nov 2, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Remove collection_file_item table #324

Remove collection_file_item table #324

duncandewhurst commented Feb 18, 2021 •

edited

Loading

jpmckinney commented Feb 18, 2021

duncandewhurst commented Feb 18, 2021

jpmckinney commented Feb 18, 2021 •

edited

Loading

duncandewhurst commented Feb 18, 2021

jpmckinney commented Feb 18, 2021

jpmckinney commented Apr 15, 2024

jpmckinney commented Apr 15, 2024 •

edited

Loading

jpmckinney commented Nov 2, 2024

jpmckinney commented Nov 2, 2024

Remove collection_file_item table #324

Remove collection_file_item table #324

Comments

duncandewhurst commented Feb 18, 2021 • edited Loading

jpmckinney commented Feb 18, 2021

duncandewhurst commented Feb 18, 2021

jpmckinney commented Feb 18, 2021 • edited Loading

duncandewhurst commented Feb 18, 2021

jpmckinney commented Feb 18, 2021

jpmckinney commented Apr 15, 2024

jpmckinney commented Apr 15, 2024 • edited Loading

jpmckinney commented Nov 2, 2024

jpmckinney commented Nov 2, 2024

duncandewhurst commented Feb 18, 2021 •

edited

Loading

jpmckinney commented Feb 18, 2021 •

edited

Loading

jpmckinney commented Apr 15, 2024 •

edited

Loading