Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(memory): Reduce memory and cpu usage for standardizeColNames #95

Merged
merged 1 commit into from
Jul 15, 2024

Conversation

tonywu1999
Copy link
Contributor

@tonywu1999 tonywu1999 commented Jul 12, 2024

Motivation and Context

Using the dataset from the Fragpipe-MSstats Nature Protocols paper, I discovered that there are sections of the MSstatsConvert code that are not efficient in space and time.

In one case, when running .standardizeColnames here, I saw that the memory available decreased by 70GB (despite our dataset being 20GB). This code also took 6.5 minutes to run.

Changes

  • Refactor .standardizeColnames to be more efficient w.r.t. space and time.
  • The idea is that stringi is memory and time intensive when processing large vectors. But if we have millions of rows, there are usually many duplicate run names, so we only need to run stringi operations on unique run names. Then, we can do a table lookup for all runs, which is O(1) for each lookup.

Testing

  • Added and modified unit tests that verified that number of rows in converter datasets didn't change from before and that standardizeColnames does not change the order or length of the input.
  • Using the Nature Protocols dataset
    • Memory used within the function decreased from 70GB to 20GB.
    • For the original implementation, memory available had decreased by 70GB when comparing before and after the function call. On the other hand, memory available remain unchanged for this new implementation.
    • Runtime decreased from 6.5 minutes to 12 seconds.
    • Although RStudio memory used isn't necessarily reliable, it reported an interesting metric after the converter completed & after running garbage collection:
      • Original implementation RStudio memory used: 68GB
      • New implementation RStudio memory used: 21GB
      • My explanation here is that our old implementation caused so much overhead that even garbage collection couldn't properly clean up unused memory.

Checklist Before Requesting a Review

  • I have read the MSstats contributing guidelines
  • My changes generate no new warnings
  • Any dependent changes have been merged and published in downstream modules

@mstaniak
Copy link
Contributor

Well done, thank you

Copy link
Contributor

@devonjkohler devonjkohler left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great!

@tonywu1999 tonywu1999 merged commit 9a28efc into devel Jul 15, 2024
1 check passed
@tonywu1999 tonywu1999 deleted the memory-fix branch July 15, 2024 14:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants