Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Helm chart for pcm-sensor-server #727

Draft
wants to merge 8 commits into
base: master
Choose a base branch
from

Conversation

ppalucki
Copy link
Contributor

@ppalucki ppalucki commented Apr 26, 2024

Initial version of Helm chart for deploying PCM

Features include:

  • Two methods of collecting metrics (indirect the default, by using Linux interfaces perf/resctrl) or direct by accessing msr registers directly through linux msr module,
  • Support for bare-metal (full set of metrics) and VM cloud instances (limited set of metrics),
  • Optional integration with Prometheus opeartor, node-feature-discovery and NRI balloons policy plugin,
  • Other described here: https://github.com/ppalucki/pcm/blob/ppalucki/helm/deployment/pcm/README.md

TODO:

Must have:

  • consider using CAP_PERFMON (kernel 5.8+) instead of SYS_ADMIN with default perf_event_paranoid=-2 (SYS_ADMIN is just for backward compatibility) more info here
  • direct method (with msr access) bug and privilaged (described in alternative methods) on kernels 5.9+ /sys/modules/msr/parameters/allow_writes is not accessible (msr.cpp doesn't use tryOpen (nor readSys/writeSys) with /pcm prefix ends up with non-critical warning
  • Resolve issue with mounting pcm/resctrl (EDITED: reason some missconfiguration with kind or sys or conflict with base_runtime_spec, docker still doesnt work)
  • Check if energy metrics can be accessible through perf subsystem (so far no available for indirect method but perf list | grep -i energy returns power/energy-pkg/ and power/energy-ram/ possible development to PCM required - are available through perf subsystem perf_even_open syscall and enumerating/discovering /sys/devices/power/events PMU type=9 or through power cap framework similary to node_exporter/rapl collector, reading data directly from sysfs like this cat /sys/class/powercap/intel-rapl/intel-rapl:0/energy_uj
  • Reorganize README to move optional feature down below (or separate documents)
  • Optional vertical pod autoscaler and proper resource sizing (cpu/mem requests/limits)
  • automation for e2e and linter helm chart,
  • Implement Helm chart test pods
  • Expose metrics with TLS by default (node-exporter can use 3party proxy example see example
  • Remove some debug code (debugPcm/debugSleep/prints, extra testing packages from docker image)
  • clean up unused mounts sysfcMount (we already has sys mounted - just to be sure it is usable)
  • Fix NOTES (WIP)
  • Docs: consider documentation values in a form of table (e.g. https://artifacthub.io/packages/helm/bitnami/prometheus#parameters or https://artifacthub.io/packages/helm/bitnami/node-exporter#parameters) or at least something similar to https://artifacthub.io/packages/helm/prometheus-community/prometheus#configuration
  • "Broken pipe, Socket operation on non-socket" errors when PodMonitor is enabled (ulimit/permissions?!)
  • Grafana panel automatically deployed (currently only there are instructions, limitation grafana panel isn't universal)
  • GitHub actions for linter/security scanners (added Mafefile with linter)
  • Optional refactor extra features (to be discussed during review): node-feature-discovery, NRI interegration only as extra values for generic fields (annotations, nodeSelector/nodeAffinity)
  • Test liveness/readiness probes work as expected in failure case,

Testing:

  • NMI watch is properly reconfigued (disabled/enabled) basd on VM/env flags (now it is set to RO and it works?!)
  • Testing in Cluster Manager Systems like (e.g. Ranger,Gardener) different node types VM(1socket,all sockets), bare-metal
  • Test in different cloud GCP/Azure/AWS

Follow up tasks (in another PRs ideas):

  • pcm-sensor-server: expose missing metrics available in other tools:
    • energy,
    • PCI-e as from pcm-pcie,
    • numa-split
  • Change metrics names (follow Prometheus best practices) - it will be separate PR
  • init container to check permission for all required components (devices/CPU)
  • Configurable collection period (and aggregators history) to limit cpu/memory usage (TODO: do some memory profiling first)

@ppalucki ppalucki changed the title Helm chart for pcm - initial version Helm chart for pcm-sensor-server Apr 26, 2024
@jcpunk
Copy link
Contributor

jcpunk commented Apr 26, 2024

It would be helpful if you could put custom labels on the podMonitor to let prometheus filter based on that.

@rdementi
Copy link
Contributor

the FreeBSD check failure is unrelated

Copy link
Contributor

@rdementi rdementi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

great start! Your TODO list makes sense to me. Thank you!

deployment/pcm/README.md Show resolved Hide resolved
@@ -551,7 +551,7 @@ bool PCM::L3CacheOccupancyMetricAvailable() const

bool PCM::CoreLocalMemoryBWMetricAvailable() const
{
if (cpu_model == SKX && cpu_stepping < 5) return false; // SKZ4 errata
//if (cpu_model == SKX && cpu_stepping < 5) return false; // SKZ4 errata
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

did you remove these checks for testing purposes?

@@ -561,7 +561,7 @@ bool PCM::CoreLocalMemoryBWMetricAvailable() const

bool PCM::CoreRemoteMemoryBWMetricAvailable() const
{
if (cpu_model == SKX && cpu_stepping < 5) return false; // SKZ4 errata
//if (cpu_model == SKX && cpu_stepping < 5) return false; // SKZ4 errata
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same here

Copy link
Member

@fmuyassarov fmuyassarov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I only checked the commit that adds the Helm chart and it looks good to me. The only thing I wondered is do you really expect users to modify all the available values in the values.yaml. To me some looked not very necessary but I can't judge since I'm not familiar with PCM.
LGTM

deployment/pcm/values.yaml Outdated Show resolved Hide resolved
@ppalucki
Copy link
Contributor Author

ppalucki commented May 8, 2024

It would be helpful if you could put custom labels on the podMonitor to let prometheus filter based on that.

@jcpunk Thanks for this comment, that is very helpfull indeed - having that there is no longer need to hack prometheus-operator chart to disable podMonitorSelector - I added comments in README/values file about this (check this changes in commit for details (7f2c707#diff-618d3b78482c88190c469bb01731f774bb931bcdc14db7b8980691f5745ba08aR151-R152) or this documentation added in values

# Extra PodMonitor labels to let Prometheus operator filter based on that
# e.g. default "kube-prometheus-stack" helm chart requires additional release:"{name of chart release}" label in podMonitor to be considered
# here is example how to check extra labels required to be added to PodMonitor
# 1) kubectl get prometheus -o jsonpath='{.items[].spec.podMonitorSelector.matchLabels}' # e.g. release: prometheus
# 2) helm install pcm . --set podMonitor=true --set podMonitorLabels.release=prometheus

Anyway help for pointing this.

@ppalucki
Copy link
Contributor Author

ppalucki commented May 8, 2024

I only checked the commit that adds the Helm chart and it looks good to me. The only thing I wondered is do you really expect users to modify all the available values in the values.yaml. To me some looked not very necessary but I can't judge since I'm not familiar with PCM. LGTM

That is the trade of between flexibility and complexity that I'm finding hard to balance.

I see two options here:

  1. Limit number of features (supported collection methods direct/indirect, podmonitor, NFD) - worth considering, but is hard to predict which features are valueable for others (even for me modest needs I see cases where I want all of them).

  2. Move logic from values to templates (less "raw" values) - e.g. instead of requiring to explicit enviornment values directly, expose values that represent set of them like "--set directCollection=false or true" will set proper combination of (PCM_NO_MSR, PMC_NO_PERF) - but I alread tried that and it would add a lot of complexity in templates (e.g. PCM_NO_MSR conflicts with PCM_NO_PERF and both are related to PCM_USE_UNCORE_PERF), imagine I could do something like this inside template:

if direct:
  if vm:
    if rdt:
       PCM_NO_MSR=0, PCM_NO_PERF=1, PCM_USE_UNCORE_PERF=0
  else:
       PCM_NO_MSR=1 PCM_NO_PERF=0, PCM_USE_UNCORE_PERF=1
  ....
else:
  if vm:
    PCM_NO_MSR=0, PCM_NO_PERF=1, PCM_USE_UNCORE_PERF=0
  else:
    PCM_NO_MSR=1 PCM_NO_PERF=0, PCM_USE_UNCORE_PERF=1    

and try to handle all possible (both proper and inproper combination) - but now rewrite this using go/helm template language - not very maintanable nor readable - I don't really want chart to be validator of possible PCM envs and I'm terified of hardcoding all proper possible combinations - in other words - if you don't like defaults (or other example value files) - then you're responsible for validating that pcm-sensor-server binary will run - but finding right combination is your job in your case.

(there is also another option - e.g. allow to pass *any enviornment/options to pcm - but that is discouraged security practice)

I the end, I decided to just allow use of this pcm chart to pass those "all PCM suppored" environment variables directly - and tried to cover possible value of combinations as different values files.

One more comment, I agree that value file is alread quite big, but not yet as scary as the I see in prometheus node exporter official chart values.yaml - it is 500 lines vs my so far about 100 (but I miss some feature though RBACs, vertical scaling) which I'm using as example of good practicies :)

@ppalucki ppalucki force-pushed the ppalucki/helm branch 3 times, most recently from a30b342 to 8edb9dc Compare May 22, 2024 15:49
old comments:

sys/pci/mcfg mounts are unnessesary for indirect method
fix old wrong defaults in README
fix formatting
possible fix for issue with resctrl
remove hacks to handle /pcm/resctrl and unessesary out-of-date files
update License to use the same as pcm itself
update README, remove out-of-date info
links do values
formatting + links do values
update README an values comments
update README
address jcfunk comments: interval and extra labels for PodMonitor + refactor readme
fix typos
readme: reminder about removing msr kernel module
after rebasing: point to correct default pcm image from intel organization

Refactoring:

- explicit values file for privileged direct method,
- hide (into docs directory) "unprivileged" direct method (and fixes),
- remove unnessesary mounts (mcfg, /dev/cpu/dev/mem for privileged access),
- add instructions to collection methods,
- fixes (extra builder) for build local development image,
- silent mode
- move collection methods to the top

fix values files for direct privileged method

New: support for PERFMON capability, silent mode and some extra env
debug variables

VPA: v1 - first version of vertical pod autoscaler

Grafana dashboard: instructions

rename resctrlHostMount to resctrlMount

fix dashboard rate interval

pcm-sensor-server: add new metrics DRAM Local percantage

Fix dockerbuild by using separate Dockerfile + build in dockerignore

improve dockerfile.debug

extra env PCM_NO_MAIN_EXCEPTION_HANDLER
Copy link
Contributor

@rdementi rdementi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks!

Comment on lines +430 to +433

if (pcm->localMemoryRequestRatioMetricAvailable())
printCounter( "DRAM Local Percentage", getLocalMemoryRequestRatio( before, after ) );

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the addition of a new metric should be split out into a separate merge request or at least a separate commit.

Comment on lines +722 to +725

if (pcm->localMemoryRequestRatioMetricAvailable())
printCounter( "DRAM Local Percentage", getLocalMemoryRequestRatio( before, after ) );

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same here.

@Damenus
Copy link
Member

Damenus commented Jun 26, 2024

Hello,
I've attempted to deploy the new Helm charts introduced in this pull request, but I'm encountering an error during the installation process. Here are the details:

image

I think that I have successfully made all steps before 6) Deploy PCM helm chart in paragraph Validation on local kind cluster
Could you add information on how to deal with it when this error happens?


- kubectl/kind/helm/jq binaries available in PATH,
- docker service up and running.
- full set of metrics available only bare-metal instance or Cloud .metal instance.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you could add information on how to check it or enabled it.

helm install ... --set nfd=true --set podMonitor=true --set verticalPodAutoscaler.enabled=true
```

### Requirements
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you could add information on how to check it or enable all requirements.

@Damenus
Copy link
Member

Damenus commented Jul 17, 2024

I was trying to run e2e test make e2e-default, but it doesn't work without extra steps. I got error in pod pcm:

Linux Perf: Error when programming INST_RETIRED, error: Permission denied with config 0x1 config1 0x0 for tid -1 leader -1
try running with environment variable PCM_NO_PERF=1

Please, add information about how to add this env

Also, the file ./_kind_with_registry.sh which is downloading during the test execution from Makefile is broken. The Cluster yaml definition is not valid. Removing a broken line helped.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants