Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove collection_file_item table #324

Closed
duncandewhurst opened this issue Feb 18, 2021 · 9 comments
Closed

Remove collection_file_item table #324

duncandewhurst opened this issue Feb 18, 2021 · 9 comments
Labels
database Changes to the database (adding indices, renaming columns) refactor
Milestone

Comments

@duncandewhurst
Copy link

duncandewhurst commented Feb 18, 2021

I see many of these errors in recent uk_contracts_finder collections (examples):

['(psycopg2.errors.UniqueViolation) duplicate key value violates unique constraint "unique_collection_file_identifiers"\nDETAIL:  Key (collection_id, filename)=(1914, page-2386.json) already exists.\n [SQL: \'INSERT INTO collection_file (collection_id, filename, url) VALUES (%(collection_id)s, %(filename)s, %(url)s) RETURNING collection_file.id\'] [parameters: {\'collection_id\': 1914, \'filename\': \'page-2386.json\', \'url\': \'https://www.contractsfinder.service.gov.uk/Published/Notices/OCDS/Search?order=desc&page=2386\'}] (Background on this error at: http://sqlalche.me/e/gkpj)']

I think this is telling me that the data already exists in the database. What (if anything) should helpdesk analysts do about these errors?

cc @odscrachel

@jpmckinney
Copy link
Member

What’s the error rate per total files? I think it’s just a race condition, which the new version of Kingfisher Process won’t have.

@duncandewhurst
Copy link
Author

collection_file has ~2,500 entries and collection_file_item_errors has ~400 entries for the above error. Although, I'm not clear on the meaning of the number column in collection_file_item_erorrs, which sums to ~2,800.

@jpmckinney
Copy link
Member

jpmckinney commented Feb 18, 2021

ContractsFinder returns a list of 100 release packages on each page (each containing a single release). The URL for the page is stored in collection_file, and then each package causes the addition of a collection_file_item, numbered from 1 to 100.

In the past, for ContractsFinder, Kingfisher Collect sent the original file as-is, and Kingfisher Process had to parse the array. For good reasons, Kingfisher Collect now sends the file items (the individual packages).

When Kingfisher Process receives a file item, it creates the file if it doesn't exist yet. If it receives multiple file items at once for the same file (which is occurring now), it might attempt to create the same file multiple times - causing an error.

This sort of issue occurs in a few places in the current version of Kingfisher Process, for which this level of concurrency hadn't been considered. It's an old issue that's occurring more frequently now (other spiders did send file items earlier).

For collection 1914, there are 2529 collection_file rows, 245,129 collection_file_item rows, and 244,726 release rows. So, you're missing 403 releases due to this issue. Is that a problem for your use case? You currently have 99.8% of the data.

In a few weeks, the new Kingfisher Process should be in place, which doesn't have this issue.


As further background, the idea behind collection_file and collection_file_item is to maintain provenance information in cases where a file contains multiple OCDS documents.

However, this could just as well be modelled as a collection_file table with a number column, which is 0 if the file itself is an OCDS document, and 1 and upwards if the file contains multiple OCDS documents.

If no one cares about collection_file_item (or even knows about or understands it), then removing it could simplify the Kingfisher Process architecture.

(Or, with the same outcome, we can remove collection_file, and have collection_file_item relate directly to collection.)

cc @jakubkrafka for awareness on this last point.

@jpmckinney jpmckinney added the bug Something isn't working label Feb 18, 2021
@jpmckinney jpmckinney added this to the V1 milestone Feb 18, 2021
@duncandewhurst
Copy link
Author

Thanks!

When Kingfisher Process receives a file item, it creates the file if it doesn't exist yet.

By 'creates a file' you mean an entry in the collection_file table, right?

For collection 1914, there are 2529 collection_file rows, 245,129 collection_file_item rows, and 244,726 release rows. So, you're missing 403 releases due to this issue. Is that a problem for your use case? You currently have 99.8% of the data.

It's not a critical problem, but I'm working on some standard error-checking queries to include in all analyst notebooks, so good to know that errors in collection_file_item_errors mean that Kingfisher Process failed to load some data and that, in this case at least, it's not due to an issue with the publisher's data.

@jpmckinney
Copy link
Member

By 'creates a file' you mean an entry in the collection_file table, right?

Yes

@jpmckinney jpmckinney changed the title psycopg2.errors.UniqueViolation - action for helpdesk analysts? Remove collection_file_item table, and add number to collection_file table Aug 19, 2021
@jpmckinney jpmckinney modified the milestones: V1, V3 Aug 19, 2021
@jpmckinney jpmckinney added framework and removed bug Something isn't working labels Aug 19, 2021
@jpmckinney jpmckinney added database Changes to the database (adding indices, renaming columns) and removed framework labels Jun 8, 2022
@jpmckinney jpmckinney modified the milestones: V3, Priority Jun 8, 2022
@jpmckinney jpmckinney modified the milestones: Database changes, Priority Apr 12, 2024
@jpmckinney
Copy link
Member

Noting that Process doesn't receive the number for a FileItem, and the number is presently always 0.

The number remains affixed to the filename, though it could be hard to disambiguate from the basename.

The number was originally useful when Process v1 received a full file (e.g. line-delimited JSON), which it then broke into FileItems. The number could then be used to index into the full file that was written to disk. Collect now guarantees individual packages and does the splitting on its side. So, there is no need for a number for indexing.

So, I've edited the issue title to not add number to collection_file.

@jpmckinney jpmckinney changed the title Remove collection_file_item table, and add number to collection_file table Remove collection_file_item table Apr 15, 2024
@jpmckinney jpmckinney changed the title Remove collection_file_item table Remove either collection_file or collection_file_item table Apr 15, 2024
@jpmckinney
Copy link
Member

jpmckinney commented Apr 15, 2024

If we remove collection_file_item:

  • Change release, record and compiled_release to refer to the collection_file directly instead of through collection_file_item
  • Remove code in file_worker and compiler that creates collection_file_item rows (an essentially useless row, as it just links to the collection_file and sets number to 0)
  • Simplify collection_file_item__collection_file lookups
  • tests: Edit collection_file_item.collection_file code in test_compile_release.py
  • tests: Update complete_db.json

If we remove collection_file:

  • Move filename and url from collection_file onto collection_file_item
  • Change processing_step to refer to a collection_file_item
  • Change collection_file_item to refer to a collection
  • Change a lot more code, since collection_file_id is used in messages

Removing collection_file_item is simpler.

@jpmckinney jpmckinney changed the title Remove either collection_file or collection_file_item table Remove either collection_file_item table Apr 15, 2024
@jpmckinney jpmckinney changed the title Remove either collection_file_item table Remove collection_file_item table Apr 15, 2024
@jpmckinney
Copy link
Member

Closed by 886159f

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
database Changes to the database (adding indices, renaming columns) refactor
Projects
None yet
Development

No branches or pull requests

2 participants