Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve CPU resource monitor for Linux systems #4888

Open
wants to merge 18 commits into
base: master
Choose a base branch
from

Conversation

vmapetr
Copy link
Contributor

@vmapetr vmapetr commented Jul 9, 2024

This PR improves the resource monitor for Linux OS. It replaces the ProcessInvoker calls with the direct calls to system api to improve the overall performance, and fixes several issues with the Memory and CPU monitoring.
Changes:

  • ProcessInvoker calls replaced with the direct reads of /proc/stat and /proc/meminfo
  • Memory monitoring now calculates used memory using the available memory parameter instead of free memory which is more correct considering how Linux operates with the RAM.
  • Previously CPU usage was calculated by simple /proc/stat read which showed average CPU load during the whole uptime. Now it uses the delta calculations between 10 samples to provide CPU utilization in the current moment.

WI: 2190905
How it was tested:
Validated on Ubuntu 22.04 / Debian 12 / Alpine Linux 3.19.0 / RHEL 8 / Fedora 40 (docker).
Validation details in the related workitem.

@vmapetr vmapetr requested review from a team as code owners July 9, 2024 02:37
@vmapetr vmapetr changed the title Replace free call on linux Improve CPU resource monitor on Linux monitor Jul 11, 2024
@vmapetr vmapetr changed the title Improve CPU resource monitor on Linux monitor Improve CPU resource monitor for Linux systems Jul 11, 2024
{
Trace.Error($"Error on receiving CPU info: {message.Data}");
};
double idle = samples[i][3] - samples[i - 1][3];
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One thing to check: there may be risk of index to of range here if the array is not in the expected shape. it's better to check length of the array to ensure the indexes are in range to avoid the risk. Alternative is catch IndexOutOfRange exception.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The index out of range issue should not happen here because of two reasons - the first index is the index of the collected sample, and we always gather samplesCount + 1 before executing the calculation loop. The only case when we can get lesser number of samples if the previous loop will be canceled due to timeout but in this case the whole method will be canceled, and we do not enter in this loop. The second index is responsible for the /proc/stat field, and the output from there may be not available in two cases: if the /proc/stat is unavailable which signalize that the OS is broken, and if the output from CPU counter has changed, which may happen extremely rarely only with major kernel updates.

// The available memory contains the sum of free, cached, and buffer memory
// it shows more accurate information about the memory usage than the free memory counter
int totalMemory = int.Parse(memoryInfo[0].Split(" ", StringSplitOptions.RemoveEmptyEntries)[1]);
int availableMemory = int.Parse(memoryInfo[2].Split(" ", StringSplitOptions.RemoveEmptyEntries)[1]);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

need to check array length, there is no guarantee [0] or [2] exists

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reason we don't check the array length here is because we parse /proc/meminfo, and the indexes here may not exist only in two cases: 1. if the /proc/meminfo is unavailable - which is guarantee that the system itself is broken and most of the services will not work anyway, so the agent fails even before the monitor runs. 2. The output format in /proc/meminfo has changed - which happens rarely once in a decade and previously with the updates only the new fields were added, and already existing fields were never removed.

@@ -710,6 +710,13 @@ public class AgentKnobs
new EnvironmentKnobSource("DISABLE_RESOURCE_UTILIZATION_WARNINGS"),
new BuiltInDefaultKnobSource("false"));

public static readonly Knob DisableResourceMonitorDebugOutput = new Knob(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be EnableResourceMonitorDebugOutput, default off. As it is, the functionality will be enabled on agent upgrade and add risk at time of upgrading agent. Risk should be shifted to when enabling the feature flag.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants