-
Notifications
You must be signed in to change notification settings - Fork 660
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merge and deploy re-architecture branch #6191
Comments
Estimate database storage requirements for high-activity courses, old vs new system (10-day timeslices)There are currently two high-activity courses already ingested that I'm using to estimate database storage requirements. 1. Monthly Outstanding Editor RecognitionData-rearchitecture link *The data-rearchitecture version has much more articles because it has duplicate articles courses records. This is because there were several processes ingesting the course simultaneously. However, it should not affect the storage for timeslices (even Production link Data storage
For course 10002 (data-rearchitecture instance) For course 27800 (production) Number of Rows: estimated based on total edits metric (2) The estimate is made on the basis of the following data: (2) is a more accurate option Index storageI'm unsure how to calculate the storage used by indexes for a specific course. My first naive estimation is to compute the index-to-data ratio for each table (i.e., the proportion of space occupied by indexes relative to data) and then multiply that ratio by the amount of data for the course. However, since the indexes use a B-Tree structure, I suspect the ratio doesn't scale exactly in that manner. Timeslices tables have 5 indexes (beyond PRIMARY):
Query executed on
act index/data ratio: 35061760/25739264 = 1.36 Index storage for course 10002 = ratio * data storage = (1.36 + 0.18 + 0.61) * 7.80 = 16.77 MB Revisions table has 4 index (beyond PRIMARY)
(1) Query executed on information_schema.tables in the Wiki Education database
index/data ratio: 81969152/570228736 = 0.14 (2) Query executed on information_schema.tables in the outreachdashboard database
index/data ratio: 2022522880/89261932544 = 0.02 (2) is a more accurate option ConclusionProduction: Total storage for course 27800 with estimation (1) = 158.08 + 22.13 = 180.21 MB Data-rearchitecture instance: 2. WikiConectaData-rearchitecture link Production link Data storage
For course 10000 (data-rearchitecture instance) For course 25877 (production) Number of Rows: estimated based on total edits metric (2) The estimate is made on the basis of the following data: (2) is a more accurate option Index storageI'm unsure how to calculate the storage used by indexes for a specific course. My first naive estimation is to compute the index-to-data ratio for each table (i.e., the proportion of space occupied by indexes relative to data) and then multiply that ratio by the amount of data for the course. However, since the indexes use a B-Tree structure, I suspect the ratio doesn't scale exactly in that manner. Timeslices tables have 5 indexes (beyond PRIMARY):
Query executed on
act index/data ratio: 35061760/25739264 = 1.36 Index storage for course 10000 = ratio * data storage = (1.36 + 0.18 + 0.61) * 10.74 = 23.1 MB Revisions table has 4 index (beyond PRIMARY)
(1) Query executed on information_schema.tables in the Wiki Education database
index/data ratio: 81969152/570228736 = 0.14 (2) Query executed on information_schema.tables in the outreachdashboard database
index/data ratio: 2022522880/89261932544 = 0.02 (2) is a more accurate option ConclusionProduction: Total storage for course 25877 with estimation (1) = 166.4 + 23.3 = 189.7 MB Data-rearchitecture instance: |
We are nearing completion of the data re-architecture work: https://github.com/WikiEducationFoundation/WikiEduDashboard/tree/data-rearchitecture-for-dashboard
This is the largest change to the update system we've ever made, so we should do as much as we can to roll it out safely and reversibly.
Preparation for dashboard.wikiedu.org rollout
NOTE: Modification to existing ArticlesCourses tab is the only migration to worry about. P&E database has 15.5 million ArticlesCourses rows.
dashboard.wikiedu.org rollout
TIMESLICE_DURATION
setting to 1 day (TIMESLICE_DURATION: '86400'
) in application.ymloutreachdashboard.wmflabs.org rollout
TIMESLICE_DURATION
setting to 1 days (TIMESLICE_DURATION: '86400'
) in application.ymlAfter rollout
TIMESLICE_DURATION
for courses when updates take a long timeThe text was updated successfully, but these errors were encountered: