Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Hotfix/2025 02 24 4067 point in time scroll #2459

Open
wants to merge 6 commits into
base: master
Choose a base branch
from

Conversation

richard-jones
Copy link
Contributor

@richard-jones richard-jones commented Mar 10, 2025


PIT iteration and anon export

Uses search_after (and optionally PIT searching) to improve full index iteration performance, and introduces some performance enhancements to the anon_export process

This PR...

  • has scripts to run
  • has migrations to run
  • adds new infrastructure
  • changes the CI pipeline
  • affects the public site
  • affects the editorial area
  • affects the publisher area
  • affects the monitoring

Developer Checklist

Developers should review and confirm each of these items before requesting review

  • Code meets acceptance criteria from issue
  • Unit tests are written and all pass
  • User Test Scripts (if required) are written and have been run through
  • Project's coding standards are met
    • No deprecated methods are used
    • No magic strings/numbers - all strings are in constants or messages files
    • ES queries are wrapped in a Query object rather than inlined in the code
    • Where possible our common library functions have been used (e.g. dates manipulated via dates)
    • Cleaned up commented out code, etc
    • Urls are constructed with url_for not hard-coded
  • Code documentation and related non-code documentation has all been updated
  • Migation has been created and tested
  • There is a recent merge from develop

Reviewer Checklist

Reviewers should review and confirm each of these items before approval
If there are multiple reviewers, this section should be duplicated for each reviewer

  • Code meets acceptance criteria from issue
  • Unit tests are written and all pass
  • User Test Scripts (if required) are written and have been run through
  • Project's coding standards are met
    • No deprecated methods are used
    • No magic strings/numbers - all strings are in constants or messages files
    • ES queries are wrapped in a Query object rather than inlined in the code
    • Where possible our common library functions have been used (e.g. dates manipulated via dates)
    • Cleaned up commented out code, etc
    • Urls are constructed with url_for not hard-coded
  • Code documentation and related non-code documentation has all been updated
  • Migation has been created and tested
  • There is a recent merge from develop

Testing

This needs to be deployed to test, and then anon_export run. First from the command line to confirm behaviour, and then also by schedule using the background jobs. If both are successful, then the next step is to ensure that the data can be re-imported from the export. To do that, export to the local machine, and then re-import from the local machine.

To export to the local machine, ensure the following setting:

STORE_IMPL = "portality.store.StoreLocal"

To import from the local machine run the command with the following arguments

python anon_import.py -s local [path to import config]

Deployment

Scripts

Once this has been deployed, the anon_export.py script should be run immediately to bring the anonymous data on S3 up to latest

New Infrastructure

By default this code DOES NOT require any infrastructure changes. It uses a search_after approach which will work on the current infrastructure, but does not absolutely guarantee the coherence of the output (probably good enough for testing).

This code change also allows for the possibility of using PIT search, which is only available in the default distributin of ES 7.10.x, that WILL NOT WORK on OSS 7.10.It also WILL NOT WORK on OpenSearch of any version, though PIT is supported in OS 2.x, it is not known if the ES client library will work with it, as the codebases have diverged.

Therefore, to use the PIT features we will need to upgrade our ES instance.

Steven-Eardley and others added 4 commits February 24, 2025 17:45
* Stub out the point in time scroll in dao.py
* Replace the hashlib.sha256 function with a faster non-cryptographic hash in anon.py
Copy link
Contributor

@Steven-Eardley Steven-Eardley left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks OK. minor comments

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants