Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Outreachy Round 27] Refactor LiftWingApi and RevisionScoreImporter #5578

Conversation

gabina
Copy link
Member

@gabina gabina commented Jan 9, 2024

What this PR does

This PR is part of the "Improve how Wiki Education Dashboard counts references added" project (read issue #5547).

It makes some refactors on LiftWingApi and RevisionScoreImporter classes.

The RevisionScoreImporter is responsible for populating fields such as wp10, features, wp10_previous, features_previous, and deleted in the revisions. Currently, this involves using the LiftWing API to retrieve data. However, as part of the "Improve how Wiki Education Dashboard counts references added" project, we also need to use the new reference-counter API (see #5565).

The primary idea is to maintain the RevisionScoreImporter as the class responsible for populating all revision score fields, regardless of whether the values come from the LiftWing API or the Toolforge reference-counter API. This approach offers benefits, as there is valuable logic that should be shared in the population process, irrespective of the data source. For example, the get_parent_revisions method retrieves parent revision IDs, necessary for updating both wp10_previous (from the LiftWing API) and features_previous (from the Toolforge reference-counter API).

Before this PR, the importer contained a lot of logic specific to the articlequality model data and how to process it (such as calculating the wp10 value from the scores).

The purpose of this refactor is to move elements too specific to the LiftWing API into the existing LiftWingApi class. This way, RevisionScoreImporter becomes a class with the logic to populate fields from both LiftWingApi and ReferenceCounterApi.

Open questions and concerns

< anything you learned that you want to share, or questions you're wondering about related to this PR >

@gabina gabina marked this pull request as draft January 9, 2024 18:46
revision.wp10 = score.dig('wp10')
revision.features = score.dig('features')
# only modify the existing deleted value if revision was deleted
revision.deleted = true if score.dig('deleted')
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The spec modified in c045872 was to be sure that revision.deleted is not updated if score.dig('deleted') isn't set to true. If score.dig('deleted') is nil (because score is the empty hash {}), then revision.deleted is not updated.

Comment on lines 30 to 38
# Given an array of revision ids, it returns an array with useful metrics for those
# revision ids.
# Format result example:
# { 'rev_id0' => { 'wp10' => 0.2915228958136511656e2, 'features' => features_value,
# 'deleted' => false, 'prediction' => 'Stub' }
# ...
# 'rev_idn' => { 'wp10' => 0.285936675221734978e2, 'features' => features_value,
# 'deleted' => false, 'prediction' => 'D' }
# }
Copy link
Member Author

@gabina gabina Jan 11, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added this as I don't think there is any other way to specify types in ruby
There is a mistake though: Given an array of revision ids, it returns a hash with useful metrics for those

# https://wikitech.wikimedia.org/wiki/Machine_Learning/LiftWing
class LiftWingApi
class LiftWingApi # rubocop:disable Metrics/ClassLength
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had to disable the rule because the class is too long.
One option could be to move the WEIGHTING constants outside the class? Other thing that would take more effort would be to split the LfitWingApi class in two different classes: one to get data form the api, and other one to process it (calculating the wp10 for example).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The better place to do this is .rubocop.yml so that we have a central list of all the too-big classes.

def get_single_revision_data(rev_id)
private

# Returns a has with wp10, features, deleted, and prediction, or empty hash if
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo here: Returns a hash

body = { rev_id:, extended_output: true }.to_json
response = lift_wing_server.post(quality_query_url, body)
parsed_response = Oj.load(response.body)

return equivalent_ores_error_response(rev_id, parsed_response) if parsed_response.key? 'error'
# If the responses contains an error, do not try to calculate wp10 or features.
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo here: If the responses contain

@@ -57,22 +57,6 @@
end
end

it 'saves wp10 scores by article' do
VCR.use_cassette 'revision_scores/by_article' do
described_class.update_revision_scores_for_all_wikis
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed all the specs calling update_revision_scores_for_all_wikis because removed the method.

@gabina gabina marked this pull request as ready for review January 11, 2024 21:00
@gabina gabina changed the title [WIP] [Outreachy Round 27] Refactor LiftWingApi and RevisionScoreImporter [Outreachy Round 27] Refactor LiftWingApi and RevisionScoreImporter Jan 11, 2024
@gabina gabina requested a review from ragesoss January 11, 2024 21:01
'error' => { 'message' => error_response['error'], 'type' => type }
} } } } }
end

# TODO: monitor production for errors, understand them, put benign ones here
TYPICAL_ERRORS = [].freeze

def log_error_batch(rev_ids)
return if @errors.empty?

log_error(@errors.first, update_service: @update_service,
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand that we always log the first error, even when there might be multiple distinct errors. I guess we work under the assumption that if several requests fail, it's likely for the same reason, which may hold true in most cases. However, I suggest resetting the @errors array in get_revision_data. Without this adjustment, if there are two batches of requests, each with a single request failing, we would log the first error twice and miss the second one entirely.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done on cde5e1d

@weighting ||= WEIGHTING_BY_LANGUAGE[@wiki.language]
end

def weighted_mean_score(score)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this whole set of weighting code might make more sense as a separate file, since turning articlequality predictions into a single score is not a concept inherent to the LiftWing API.

Perhaps it can be extracted to a module that LiftWingApi includes (similar to ConstantUpdate and its included module BatchUpdateLogging).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the idea! Added the new module on d0dfb96, and modified the LiftWingApi class to include the module on 18c7a40.
Didn't add specific specs for the new module because it's only used in the LiftWingApi, and the LiftWingApi
specs already check that the weighted mean score makes sense.

@ragesoss
Copy link
Member

Overall this looks solid. I left a couple of suggestions; other than that, it looks ready to me. @Aminehassou would you like to have a look as well?

Copy link
Contributor

@Aminehassou Aminehassou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Outside of the things Sage already mentioned, LGTM!

@@ -70,15 +75,24 @@
it 'handles timeout errors' do
stub_request(:any, 'https://api.wikimedia.org')
.to_raise(Errno::ETIMEDOUT)
expect_any_instance_of(described_class).to receive(:log_error).once
expect(lift_wing_api_class_en_wiki).to receive(:log_error).once
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed this to avoid using expect_any_instance_of here, as the spec then makes statements on subject0, which is a particular instance. I don't like making expectations to any instance and to a particular instance in the same spec, I think it's confusing.

@gabina gabina requested a review from ragesoss January 17, 2024 16:40
@ragesoss ragesoss merged commit b99cb4f into WikiEducationFoundation:master Jan 17, 2024
1 check passed
@gabina gabina deleted the 5547-outreachy-round-27-refactor-revision-score-importer-and-liftwing-api branch January 17, 2024 18:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants