-
Notifications
You must be signed in to change notification settings - Fork 8
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Remove collection_file_item table #324
Comments
What’s the error rate per total files? I think it’s just a race condition, which the new version of Kingfisher Process won’t have. |
|
ContractsFinder returns a list of 100 release packages on each page (each containing a single release). The URL for the page is stored in In the past, for ContractsFinder, Kingfisher Collect sent the original file as-is, and Kingfisher Process had to parse the array. For good reasons, Kingfisher Collect now sends the file items (the individual packages). When Kingfisher Process receives a file item, it creates the file if it doesn't exist yet. If it receives multiple file items at once for the same file (which is occurring now), it might attempt to create the same file multiple times - causing an error. This sort of issue occurs in a few places in the current version of Kingfisher Process, for which this level of concurrency hadn't been considered. It's an old issue that's occurring more frequently now (other spiders did send file items earlier). For collection 1914, there are 2529 collection_file rows, 245,129 collection_file_item rows, and 244,726 release rows. So, you're missing 403 releases due to this issue. Is that a problem for your use case? You currently have 99.8% of the data. In a few weeks, the new Kingfisher Process should be in place, which doesn't have this issue. As further background, the idea behind However, this could just as well be modelled as a If no one cares about (Or, with the same outcome, we can remove cc @jakubkrafka for awareness on this last point. |
Thanks!
By 'creates a file' you mean an entry in the
It's not a critical problem, but I'm working on some standard error-checking queries to include in all analyst notebooks, so good to know that errors in |
Yes |
Noting that Process doesn't receive the The number remains affixed to the The number was originally useful when Process v1 received a full file (e.g. line-delimited JSON), which it then broke into FileItems. The number could then be used to index into the full file that was written to disk. Collect now guarantees individual packages and does the splitting on its side. So, there is no need for a number for indexing. So, I've edited the issue title to not add |
If we remove collection_file_item:
If we remove collection_file:
Removing collection_file_item is simpler. |
Closed by 886159f |
I see many of these errors in recent
uk_contracts_finder
collections (examples):I think this is telling me that the data already exists in the database. What (if anything) should helpdesk analysts do about these errors?
cc @odscrachel
The text was updated successfully, but these errors were encountered: