Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(twitter): Enhanced Twitter Worker Selection Algorithm #591

Merged
merged 7 commits into from
Oct 15, 2024

Conversation

teslashibe
Copy link
Contributor

Description

This PR implements an improved worker selection algorithm for Twitter tasks in the Masa Oracle project. The goal is to balance between prioritizing high-performing workers and ensuring fair work distribution.

Key Changes

1. Modified GetEligibleWorkers function (pkg/workers/worker_selection.go)

  • Now uses a specialized selection process for the Twitter category (pubsub.CategoryTwitter)
  • Maintains existing behavior for all other worker categories

2. New getTwitterWorkers function (pkg/workers/worker_selection.go)

  • Selects a larger pool of top-performing nodes
  • Introduces controlled randomness by shuffling the pool of top performers
  • Creates Worker objects from the shuffled pool, respecting the original limit

3. New calculatePoolSize function (pkg/workers/worker_selection.go)

  • Determines the optimal pool size as the maximum of:
    • 5 (minimum pool size)
    • Double the requested limit
    • 20% of total nodes

4. New SortNodesByTwitterReliability function (pkg/pubsub/node_event_tracker.go)

  • Sorts nodes based on their Twitter reliability using multiple criteria:
    1. More recent last returned tweet
    2. Higher number of returned tweets
    3. Longer time since last timeout
    4. Lower number of timeouts
    5. Earlier last not found time
    6. PeerId (for stable sorting when no performance data is available)

5. Updated NodeEventTracker.GetEligibleWorkerNodes (pkg/pubsub/node_event_tracker.go)

  • Now uses SortNodesByTwitterReliability for the Twitter category

6. Enhanced logging

  • Added informative log messages using logrus for debugging and monitoring

Implementation Details

Twitter Worker Selection Process

  1. Get eligible worker nodes from NodeTracker.GetEligibleWorkerNodes(category)
  2. For Twitter category:
    a. Calculate pool size using calculatePoolSize
    b. Select top performers based on the calculated pool size
    c. Shuffle the selected top performers
    d. Create Worker objects from the shuffled pool, respecting the original limit
  3. For other categories:
    • Return all eligible workers without modification

Node Sorting for Twitter Reliability

The SortNodesByTwitterReliability function uses a multi-criteria approach to rank nodes:

  1. Prioritizes nodes with more recent last returned tweet
  2. Then by higher number of returned tweets
  3. Considers the time since last timeout (longer time is better)
  4. Then by lower number of timeouts
  5. Deprioritizes nodes with more recent last not found time
  6. Finally, sorts by PeerId for stability when no performance data is available

Benefits

  • More efficient utilization of high-performing Twitter workers
  • Fairer distribution of work among eligible nodes
  • Improved overall system performance for Twitter-related tasks
  • Maintained existing functionality for non-Twitter worker categories

TODO

Testing

  • Added unit tests for the new implementation, including edge cases
  • Ensured all existing tests pass
  • Verified linting checks are successful

restevens402 and others added 7 commits October 10, 2024 08:30
Enhanced the worker manager to append specific error messages to a list for better debugging. Additionally, updated node data to track the last update time, improving data consistency and traceability.
- Remove Retry function and MaxRetries constant from config.go
- Update ScrapeFollowersForProfile, ScrapeTweetsProfile, and ScrapeTweetsByQuery
  to remove Retry wrapper
- Adjust error handling in each function to directly return errors
- Simplify code structure and reduce complexity
- Maintain rate limit handling functionality
- Prioritize nodes with more recent last returned tweets
- Maintain high importance for total returned tweet count
- Consider time since last timeout to allow recovery from temporary issues
- Deprioritize nodes with recent "not found" occurrences
- Remove NotFoundCount from sorting criteria

This change aims to better balance node performance and recent activity,
while allowing nodes to recover quickly from temporary issues like rate limiting.
- Modify GetEligibleWorkers to use a specialized selection for Twitter workers
- Introduce controlled randomness in Twitter worker selection
- Balance between prioritizing high-performing Twitter workers and fair distribution
- Maintain existing behavior for non-Twitter worker selection
- Preserve handling of local worker and respect original worker limit

This change enhances the worker selection algorithm for Twitter tasks to provide
a better balance between utilizing top-performing nodes and ensuring fair work
distribution. It introduces a dynamic pool size calculation and controlled
randomness for Twitter workers, while maintaining the existing round-robin
approach for other worker types.
@teslashibe teslashibe self-assigned this Oct 11, 2024
@teslashibe teslashibe marked this pull request as ready for review October 11, 2024 18:09
Copy link
Contributor

@restevens402 restevens402 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The change looks great. Love the new twitter selection.

@teslashibe teslashibe merged commit 6f594e1 into main Oct 15, 2024
9 of 11 checks passed
@teslashibe teslashibe deleted the fix-node-data-updates branch October 15, 2024 15:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants