-
Notifications
You must be signed in to change notification settings - Fork 856
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve CPU resource monitor for Linux systems #4888
base: master
Are you sure you want to change the base?
Conversation
{ | ||
Trace.Error($"Error on receiving CPU info: {message.Data}"); | ||
}; | ||
double idle = samples[i][3] - samples[i - 1][3]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One thing to check: there may be risk of index to of range here if the array is not in the expected shape. it's better to check length of the array to ensure the indexes are in range to avoid the risk. Alternative is catch IndexOutOfRange exception.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The index out of range issue should not happen here because of two reasons - the first index is the index of the collected sample, and we always gather samplesCount
+ 1 before executing the calculation loop. The only case when we can get lesser number of samples if the previous loop will be canceled due to timeout but in this case the whole method will be canceled, and we do not enter in this loop. The second index is responsible for the /proc/stat
field, and the output from there may be not available in two cases: if the /proc/stat
is unavailable which signalize that the OS is broken, and if the output from CPU counter has changed, which may happen extremely rarely only with major kernel updates.
// The available memory contains the sum of free, cached, and buffer memory | ||
// it shows more accurate information about the memory usage than the free memory counter | ||
int totalMemory = int.Parse(memoryInfo[0].Split(" ", StringSplitOptions.RemoveEmptyEntries)[1]); | ||
int availableMemory = int.Parse(memoryInfo[2].Split(" ", StringSplitOptions.RemoveEmptyEntries)[1]); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
need to check array length, there is no guarantee [0] or [2] exists
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The reason we don't check the array length here is because we parse /proc/meminfo
, and the indexes here may not exist only in two cases: 1. if the /proc/meminfo
is unavailable - which is guarantee that the system itself is broken and most of the services will not work anyway, so the agent fails even before the monitor runs. 2. The output format in /proc/meminfo
has changed - which happens rarely once in a decade and previously with the updates only the new fields were added, and already existing fields were never removed.
@@ -710,6 +710,13 @@ public class AgentKnobs | |||
new EnvironmentKnobSource("DISABLE_RESOURCE_UTILIZATION_WARNINGS"), | |||
new BuiltInDefaultKnobSource("false")); | |||
|
|||
public static readonly Knob DisableResourceMonitorDebugOutput = new Knob( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be EnableResourceMonitorDebugOutput, default off. As it is, the functionality will be enabled on agent upgrade and add risk at time of upgrading agent. Risk should be shifted to when enabling the feature flag.
This PR improves the resource monitor for Linux OS. It replaces the ProcessInvoker calls with the direct calls to system api to improve the overall performance, and fixes several issues with the Memory and CPU monitoring.
Changes:
/proc/stat
and/proc/meminfo
used memory
using theavailable memory
parameter instead offree memory
which is more correct considering how Linux operates with the RAM./proc/stat
read which showed average CPU load during the whole uptime. Now it uses the delta calculations between 10 samples to provide CPU utilization in the current moment.WI: 2190905
How it was tested:
Validated on Ubuntu 22.04 / Debian 12 / Alpine Linux 3.19.0 / RHEL 8 / Fedora 40 (docker).
Validation details in the related workitem.