-
Notifications
You must be signed in to change notification settings - Fork 157
recalculate statistics use sql #696
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
edd2956 to
37b84f2
Compare
| return cast(Stat | None, stat) | ||
|
|
||
|
|
||
| async def sql_recalculate_mode(player_id: int, mode: int) -> None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While I agree this is objectively better from a performance and atomicity perspective, I feel that most developers will have more trouble maintaining this, it's relatively complex sql and the procedural code (in python) I suspect is a more common of a skillset for the devs who work on b.py (and maintainability is the #1 goal for b.py)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can have a scaled down version (which only calculates pp and pp acc), or split query into smaller ones.
async def sql_recalculate_mode_pp(player_id: int, mode: int) -> None:
sql = f"""
WITH
ordered_pp AS (
SELECT
s1.id scoreId,
s1.userid,
s1.mode,
s1.pp,
s1.acc
FROM
scores s1
INNER JOIN maps m ON s1.map_md5 = m.md5
WHERE
s1.status = 2
AND m.status IN (2, 3)
AND s1.pp > 0
AND s1.mode = :mode
AND s1.userid = :user_id
ORDER BY
s1.pp DESC,
s1.acc DESC,
s1.id DESC
),
bests AS (
SELECT
scoreId,
pp,
acc,
ROW_NUMBER() OVER (
ORDER BY
pp DESC
) AS global_rank
FROM
ordered_pp
),
user_calc AS (
SELECT
COUNT(*) AS count,
SUM(POW (0.95, global_rank - 1) * pp) AS weightedPP,
(1 - POW (0.9994, COUNT(*))) * 416.6667 AS bnsPP,
SUM(POW (0.95, global_rank - 1) * acc) / SUM(POW (0.95, global_rank - 1)) AS acc
FROM
bests
),
calculated AS (
SELECT
*,
weightedPP + bnsPP AS pp
FROM
user_calc
)
UPDATE stats s
INNER JOIN calculated c ON 1
SET
s.pp = c.pp,
s.acc = c.acc
WHERE s.id = :user_id AND s.mode = :mode
"""
_ = await app.state.services.database.execute(
sql, {"user_id": player_id, "mode": mode}
)
async def sql_recalculate_mode_statistics(player_id: int, mode: int) -> None:
sql = f"""
WITH
concrete_stats AS (
SELECT
SUM(s2.score) AS total_score,
SUM(IF(s2.status IN (2, 3), s2.score, 0)) AS ranked_score,
SUM(s2.n300 + s2.n100 + s2.n50 + (IF(s2.mode IN (1, 3, 5), s2.ngeki + s2.nkatu, 0))) AS total_hits,
SUM(s2.time_elapsed) AS play_time,
SUM(s2.grade = "XH") AS xh_count,
MAX(s2.max_combo) AS max_combo,
SUM(s2.grade = "X") AS x_count,
SUM(s2.grade = "S") AS s_count,
SUM(s2.grade = "A") AS a_count
FROM
scores s2
INNER JOIN maps m2 ON s2.map_md5 = m2.md5
WHERE s2.mode = :mode
AND s2.userid = :user_id
)
UPDATE stats s
INNER JOIN concrete_stats cs ON 1
SET
s.tscore = cs.total_score,
s.rscore = cs.ranked_score,
s.plays = c.count,
s.playtime = cs.play_time,
s.max_combo = cs.max_combo,
s.total_hits = cs.total_hits,
s.xh_count = cs.xh_count,
s.x_count = cs.x_count,
s.sh_count = cs.s_count,
s.s_count = cs.a_count
WHERE s.id = :user_id AND s.mode = :mode
"""
_ = await app.state.services.database.execute(
sql, {"user_id": player_id, "mode": mode}
)if this is acceptable we could strike a balance between procedural code & performance.
Our server ran into performance / memory issue recently. With this update our score submission speed went from ~5s for a heavy user, into ~600ms for single mode, and few orders better when users MP (don't forget that we have global submission thread lock). Not to mention that we also saved bunch of IO and memory.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this also comes with added benefits for example when a map changes states, rather than setup complex trigger and handlers, taking care of any potential edge cases, statistics will be corrected next time this user submits a score.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
my feedback is more that i don't think many devs are familiar with sql, and this uses more advanced features like window functions, CTEs, etc. -- i suspect having it as pure-python is more familiar for most
this also comes with added benefits for example when a map changes states, rather than setup complex trigger and handlers, taking care of any potential edge cases, statistics will be corrected next time this user submits a score.
i don't disagree w/ this, but it can be implemented in pure py as well
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i think the most important issue in recalc to fix would be recalculation of score statuses when scores of a user shift around
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
my feedback is more that i don't think many devs are familiar with sql, and this uses more advanced features like window functions, CTEs, etc. -- i suspect having it as pure-python is more familiar for most
this also comes with added benefits for example when a map changes states, rather than setup complex trigger and handlers, taking care of any potential edge cases, statistics will be corrected next time this user submits a score.
i don't disagree w/ this, but it can be implemented in pure py as well
imo it's too cluttered for something which can be implemented in like 20-50 lines of python, and yeah sql can be confusing at times
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But the perf & scalability trade off is worth the effort I believe.
it's a good point, let me consider it & review more deeply soon
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the issue with memory usage you're having here is because it loads all best scores for a given mode at once, this can be fixed with
OFFSETanyway
pagination is a good idea, as time-to-execute is less of a problem than resource utilization at large scale imo (from akatsuki experience) -- not sure but i'm not entirely remembering the technical considerations of this process (e.g. row/table locking w.r.t. transactions, requirements of upfront processing, etc.), so i'll have to refresh myself
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i think the most important issue in recalc to fix would be recalculation of score statuses when scores of a user shift around
A PR for that already exists for quite some time. #573
I believe the simplicity of the current pure python way easily outweighs any performance gains from this. As someone that ran bancho.py at its limits, score submission was rather the least of issues.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i think the most important issue in recalc to fix would be recalculation of score statuses when scores of a user shift around
we also have solution for this one as well. https://github.com/ppy-sb/nyamatrix/blob/main/nyamatrix/qb/update_score_status.py
This was built to reduce memory usage on our server during recalc. Now with built-in recalc script memory usage will go above 4GB, while running prod bpy. Our server (configured with 4c8g) don't have enough memory. New one uses 113M doing so.
This one uses sql as well so it might not be your choice of tech stack.the issue with memory usage you're having here is because it loads all best scores for a given mode at once, this can be fixed with
OFFSETanyway
There are several technical difficulties with this approach If I understand the OFFSET correctly.:
- you still have to maintain a user-mode-beatmap-scores and user-mode-beatmap-best maps in order to decide which score should be set to best.
- large offset in SQL is VERY SLOW thus modern apps and websites don't often give you ability to jump to precise pages anymore, and chosen cursor based infinity scrolling.
There's however a better way imo, to chunk user ids, as they don't interference each other.
24e064c to
57a5a0c
Compare
5333606 to
b97c348
Compare
b97c348 to
2072268
Compare
Describe your changes
Related Issues / Projects
Checklist