Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merge and deploy re-architecture branch #6191

Open
4 of 28 tasks
ragesoss opened this issue Feb 10, 2025 · 1 comment
Open
4 of 28 tasks

Merge and deploy re-architecture branch #6191

ragesoss opened this issue Feb 10, 2025 · 1 comment

Comments

@ragesoss
Copy link
Member

ragesoss commented Feb 10, 2025

We are nearing completion of the data re-architecture work: https://github.com/WikiEducationFoundation/WikiEduDashboard/tree/data-rearchitecture-for-dashboard

This is the largest change to the update system we've ever made, so we should do as much as we can to roll it out safely and reversibly.

Preparation for dashboard.wikiedu.org rollout

  • Test migrations against a recent copy of p&e database.
    NOTE: Modification to existing ArticlesCourses tab is the only migration to worry about. P&E database has 15.5 million ArticlesCourses rows.
  • Open PR to merge data-rearchitecture-for-dashboard
  • Sage reviews PR
  • Estimate database storage requirements for high-activity courses, old vs new system
  • Write message for users about changes and where to report problems

dashboard.wikiedu.org rollout

  • Add message about changes
  • add TIMESLICE_DURATION setting to 1 day (TIMESLICE_DURATION: '86400') in application.yml
  • Pause course updates, pause constant updates, wait for workers to complete current updates
  • Save a server snapshot
  • Clear the course update queues
  • Merge and deploy. (If migration will be slow, run migration manually before Capistrano deployment.)
  • manually add current term courses, ongoing Scholars & Scientists courses, to a queue, enable worker for that queue
  • wait for that queue to process all current courses
  • Monitor for problems for at least several days

outreachdashboard.wmflabs.org rollout

  • Add message about changes
  • add TIMESLICE_DURATION setting to 1 days (TIMESLICE_DURATION: '86400') in application.yml
  • Pause course updates, pause constant updates, wait for workers to complete current updates
  • Save a database dump
  • Clear the course update queues
  • Run migration manually, server-side.
  • Merge and deploy to web server
  • manually add current courses, ordered from least to most revisions, to appropriate queues
  • update sidekiq process servers, enable queue workers one at a time, monitoring for problems

After rollout

  • monitor database usage
  • monitor server resource utilization
  • monitor course update times. Reduce TIMESLICE_DURATION for courses when updates take a long time
  • monitor Sentry logs
  • begin cleanup of now-unused code
@gabina
Copy link
Member

gabina commented Feb 12, 2025

Estimate database storage requirements for high-activity courses, old vs new system (10-day timeslices)

There are currently two high-activity courses already ingested that I'm using to estimate database storage requirements.

1. Monthly Outstanding Editor Recognition

Data-rearchitecture link
ID: 10002
93.2K* articles, 100 editors, 151K edits

*The data-rearchitecture version has much more articles because it has duplicate articles courses records. This is because there were several processes ingesting the course simultaneously. However, it should not affect the storage for timeslices (even article_course_timeslices depends on article_id not on article_course_id).

Production link
ID: 27800
64.3K articles, 100 editors, 152K edits

Data storage

┌───────┬───────────────────────────────┬───────────────┬──────────────────────┬──────────────┐
│ Course│ Table Name                    │ Number of Rows│ Avg Row Length (B)   │ Total Space (MB) │
├───────┼───────────────────────────────┼───────────────┼──────────────────────┼──────────────┤
│ 10002 │ article course timeslice      │  78,154       │  97                  │  7.58        │
│ 10002 │ course wiki timeslice         │  222          │  443                 │  0.10        │
│ 10002 │ course user wiki timeslice    │  1,096        │  114                 │  0.12        │
├───────┼───────────────────────────────┼───────────────┼──────────────────────┼──────────────┤
│       │ TOTAL                         │               │                      │  7.80        │
├───────┼───────────────────────────────┼───────────────┼──────────────────────┼──────────────┤
│ 27800 │ revisions                     │ 152,000 (1)   │  1,040 (1)           │  158.08 (1)  │
├───────┴───────────────────────────────├───────────────├──────────────────────├──────────────┤
│ 27800 │ revisions                     │ 374,043 (2)   │  2,654 (2)           │  992.71 (2)  │
└───────┴───────────────────────────────┴───────────────┴──────────────────────┴──────────────┘

For course 10002 (data-rearchitecture instance)
Number of Rows: calculated using COUNT(*)
Avg Row Length: calculated via information_schema.tables

For course 27800 (production)
(1) The estimate is made on the basis of the following data:

Number of Rows: estimated based on total edits metric
Avg Row Length: calculated via information_schema.tables in wiki education db (I didn't have access to outreachdashboard db)

(2) The estimate is made on the basis of the following data:
Number of Rows: calculated using COUNT(*) in outreachdashboard db.
Avg Row Length: calculated via information_schema.tables in outreachdashboard db.

(2) is a more accurate option

Index storage

I'm unsure how to calculate the storage used by indexes for a specific course. My first naive estimation is to compute the index-to-data ratio for each table (i.e., the proportion of space occupied by indexes relative to data) and then multiply that ratio by the amount of data for the course. However, since the indexes use a B-Tree structure, I suspect the ratio doesn't scale exactly in that manner.

Timeslices tables have 5 indexes (beyond PRIMARY):

  • article_course_timeslice_by_article_course_start_and_end
  • article_course_timeslice_by_tracked
  • article_course_timeslice_by_updated_at
  • course_user_wiki_timeslice_by_course_user_wiki_start_and_end
  • course_wiki_timeslice_by_course_wiki_start_and_end

Query executed on information_schema.tables in data-reachitecture instance db

----------------------------+-----------+------------+----------+--------------+
|table_name                 |data_length|index_length|table_rows|avg_row_length|
+---------------------------+-----------+------------+----------+--------------+
|article_course_timeslices  |   25739264|    35061760|    263822|            97|
+---------------------------+-----------+------------+----------+--------------+
|course_wiki_timeslices     |     458752|       81920|      1250|           367|
+---------------------------+-----------+------------+----------+--------------+
|course_user_wiki_timeslices|     376832|      229376|      3308|           113|
----------------------------+-----------+------------+----------+--------------+

act index/data ratio: 35061760/25739264 = 1.36
cwt index/data ratio: 81920/458752 = 0.18
cuwt index/data ratio: 229376/376832 = 0.61

Index storage for course 10002 = ratio * data storage = (1.36 + 0.18 + 0.61) * 7.80 = 16.77 MB

Revisions table has 4 index (beyond PRIMARY)

  • index_revisions_on_article_id
  • index_revisions_on_article_id_and_date
  • index_revisions_on_user_id
  • index_revisions_on_wiki_id_and_mw_rev_id

(1) Query executed on information_schema.tables in the Wiki Education database

+------------+-------------+--------------+------------+----------------+
| table_name | data_length | index_length | table_rows | avg_row_length |
+------------+-------------+--------------+------------+----------------+
| revisions  |   570228736 |     81969152 |     547811 |           1040 |
+------------+-------------+--------------+------------+----------------+

index/data ratio: 81969152/570228736 = 0.14
Index storage for course 27800 with estimation (1) = ratio * data storage = 0.14 * 158.08 = 22.13 MB

(2) Query executed on information_schema.tables in the outreachdashboard database

+------------+-------------+--------------+------------+----------------+
| table_name | data_length | index_length | table_rows | avg_row_length |
+------------+-------------+--------------+------------+----------------+
| revisions  | 89261932544 |   2022522880 |   33626681 |           2654 |
+------------+-------------+--------------+------------+----------------+

index/data ratio: 2022522880/89261932544 = 0.02
Index storage for course 27800 with estimation (2) = ratio * data storage = 0.02 * 992.71 = 19.85 MB

(2) is a more accurate option
Index storage for course 27800 = ratio * data storage = 0.02 * 992.71 = 19.85 MB

Conclusion

Production:

Total storage for course 27800 with estimation (1) = 158.08 + 22.13 = 180.21 MB
Total storage for course 27800 with estimation (2) = 992.71 + 19.85 = 1012.56 MB

Data-rearchitecture instance:
Total storage for course 10002 = 7.80 + 22.13 = 29.93 MB

2. WikiConecta

Data-rearchitecture link
ID: 10000
98.8K articles, 49 editors, 162K edits

Production link
ID: 25877
98.2K articles, 51 editors, 160K edits

Data storage

┌───────┬───────────────────────────────┬───────────────┬──────────────────────┬──────────────┐
│ Course│ Table Name                    │ Number of Rows│ Avg Row Length (B)   │ Total Space (MB) │
├───────┼───────────────────────────────┼───────────────┼──────────────────────┼──────────────┤
│ 10000 │ article course timeslice      │  108,042      │  97                  │  10.48       │
│ 10000 │ course wiki timeslice         │  332          │  443                 │  0.15        │
│ 10000 │ course user wiki timeslice    │  925          │  114                 │  0.11        │
├───────┼───────────────────────────────┼───────────────┼──────────────────────┼──────────────┤
│       │ TOTAL                         │               │                      │  10.74       │
├───────┼───────────────────────────────┼───────────────┼──────────────────────┼──────────────┤
│ 25877 │ revisions                     │ 160,000 (1)   │  1,040 (1)           │  166.4  (1)  │
├───────┴───────────────────────────────├───────────────├──────────────────────├──────────────┤
│ 25877 │ revisions                     │ 160,503 (2)   │  2,654 (2)           │  425.97 (2)  │
└───────┴───────────────────────────────┴───────────────┴──────────────────────┴──────────────┘

For course 10000 (data-rearchitecture instance)
Number of Rows: calculated using COUNT(*)
Avg Row Length: calculated via information_schema.tables

For course 25877 (production)
(1) The estimate is made on the basis of the following data:

Number of Rows: estimated based on total edits metric
Avg Row Length: calculated via information_schema.tables in wiki education db (I didn't have access to outreachdashboard db)

(2) The estimate is made on the basis of the following data:
Number of Rows: calculated using COUNT(*) in outreachdashboard db.
Avg Row Length: calculated via information_schema.tables in outreachdashboard db.

(2) is a more accurate option

Index storage

I'm unsure how to calculate the storage used by indexes for a specific course. My first naive estimation is to compute the index-to-data ratio for each table (i.e., the proportion of space occupied by indexes relative to data) and then multiply that ratio by the amount of data for the course. However, since the indexes use a B-Tree structure, I suspect the ratio doesn't scale exactly in that manner.

Timeslices tables have 5 indexes (beyond PRIMARY):

  • article_course_timeslice_by_article_course_start_and_end
  • article_course_timeslice_by_tracked
  • article_course_timeslice_by_updated_at
  • course_user_wiki_timeslice_by_course_user_wiki_start_and_end
  • course_wiki_timeslice_by_course_wiki_start_and_end

Query executed on information_schema.tables in data-reachitecture instance db

----------------------------+-----------+------------+----------+--------------+
|table_name                 |data_length|index_length|table_rows|avg_row_length|
+---------------------------+-----------+------------+----------+--------------+
|article_course_timeslices  |   25739264|    35061760|    263822|            97|
+---------------------------+-----------+------------+----------+--------------+
|course_wiki_timeslices     |     458752|       81920|      1250|           367|
+---------------------------+-----------+------------+----------+--------------+
|course_user_wiki_timeslices|     376832|      229376|      3308|           113|
----------------------------+-----------+------------+----------+--------------+

act index/data ratio: 35061760/25739264 = 1.36
cwt index/data ratio: 81920/458752 = 0.18
cuwt index/data ratio: 229376/376832 = 0.61

Index storage for course 10000 = ratio * data storage = (1.36 + 0.18 + 0.61) * 10.74 = 23.1 MB

Revisions table has 4 index (beyond PRIMARY)

  • index_revisions_on_article_id
  • index_revisions_on_article_id_and_date
  • index_revisions_on_user_id
  • index_revisions_on_wiki_id_and_mw_rev_id

(1) Query executed on information_schema.tables in the Wiki Education database

+------------+-------------+--------------+------------+----------------+
| table_name | data_length | index_length | table_rows | avg_row_length |
+------------+-------------+--------------+------------+----------------+
| revisions  |   570228736 |     81969152 |     547811 |           1040 |
+------------+-------------+--------------+------------+----------------+

index/data ratio: 81969152/570228736 = 0.14
Index storage for course 25877 with estimation (1) = ratio * data storage = 0.14 * 166.4 = 23.3 MB

(2) Query executed on information_schema.tables in the outreachdashboard database

+------------+-------------+--------------+------------+----------------+
| table_name | data_length | index_length | table_rows | avg_row_length |
+------------+-------------+--------------+------------+----------------+
| revisions  | 89261932544 |   2022522880 |   33626681 |           2654 |
+------------+-------------+--------------+------------+----------------+

index/data ratio: 2022522880/89261932544 = 0.02
Index storage for course 25877 with estimation (2) = ratio * data storage = 0.02 * 425.97 = 8.52 MB

(2) is a more accurate option
Index storage for course 25877 = ratio * data storage = 0.02 * 425.97 = 8.52 MB

Conclusion

Production:

Total storage for course 25877 with estimation (1) = 166.4 + 23.3 = 189.7 MB
Total storage for course 25877 with estimation (2) = 425.97 + 8.52 = 434.49 MB

Data-rearchitecture instance:
Total storage for course 10002 = 10.74 + 23.1 = 33.84 MB

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants