Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

improve CPU cores affinity to NUMA nodes v8 #12336

Closed
wants to merge 10 commits into from

Conversation

lukashino
Copy link
Contributor

@lukashino lukashino commented Jan 3, 2025

Followup of #12316

Redmine ticket:
https://redmine.openinfosecfoundation.org/issues/7036
https://redmine.openinfosecfoundation.org/issues/2321

This work allows more precise thread assignment and it is done either

  • automatically - by picking assigned cores of the same NUMA locality as the interface from the worker-cpu-list
  • manually - you can specify interface-specific settings in the threading section

This works with AF-PACKET and DPDK (and should with other capture modes as well). The primary target is workers runmode but it also works with autofp receive threads.

To try this PR:

  1. compile Suri from this branch
  2. Modify the newly generated suricata.yaml according to your setup.
  3. Run as usual.

Describe changes:

v8:

  • interface-specific settings now don't require hwloc dependency
  • CPU affinity refactoring changes segregated to a separate commit
  • added docs - upgrade docs and general threading docs
  • a little change in a title

v7:

  • hwloc is now an optional dependency - it can be enabled with --enable-hwloc in the configure step,
  • autopinning / autooptimizing is configurable from threading.autopin suri.yaml option,
  • making CI happy (and with that fixing a few bugs)

v6

  • support for autofp mode supported - receive-cpu-set can now contain per interface CPU affinity settings - here comes a question - should worker-cpu-set in autofp be somehow configurable to the receive-cpu-set threads?
  • YAML node "per-iface" changed to "interface-specific-cpu-set"
  • list form of (management/worker/...)-cpu-set was changed to simple tree nodes
  • cpu/prio/mode properties of (management/worker/...)-cpu-set nodes moved one YAML layer up
  • DPDK (net_bonding / PCIe address / ... ) / AFP interfaces are supported

On YAML changes:

CPU affinity section in suricata.yaml has been changed to (see suricata.yaml.in for the full picture):

threading:
  cpu-affinity:
    worker-cpu-set:
      cpu: [ "all" ]
      mode: "exclusive"
      interface-specific-cpu-set:
        - interface: "net_bonding0" # 0000:3b:00.0 # net_bonding0 # ens1f0
          cpu: [ "all" ]
        - interface: "net_bonding1"
          cpu: [ "all" ]

Below is the reasoning for it.

When starting this work I wanted to follow the same structure as CPU set nodes that is:

threading:
  cpu-affinity:
    - worker-cpu-set:
        cpu: [ "all" ]
        mode: "exclusive"

converted to JSON it looks like:

{
    "threading": {
        "cpu-affinity": [
            {
                "worker-cpu-set": {
                    "cpu": [
                        "all"
                    ],
                    "mode": "exclusive"
                }
            }
        ]
    }
}

So worker-cpu-set is an object by itself, it holds the key that is named "worker-cpu-set" and only that contains its children properties (CPU, prio, etc.).

But when adding the interface-specific node I found out that having the interface names directly as the keys for the object storing the properties (CPU, prio, etc.) can change its name to something else, in this case to net-bonding0.

threading:
  cpu-affinity:
    - worker-cpu-set:
        cpu: [ "all" ]
        mode: "exclusive"
        interface-specific-cpu-set:
          - net_bonding0:
              cpu: [ "all" ]
          - net_bonding1:
              cpu: [ "all" ]

So adding it as a value:

threading:
  cpu-affinity:
    - worker-cpu-set:
        cpu: [ "all" ]
        mode: "exclusive"
        interface-specific-cpu-set:
          - interface: net_bonding0
              cpu: [ "all" ]
          - interface: net_bonding1
              cpu: [ "all" ]

yields an invalid YAML construct.

Therefore, for interface-specific CPU affinity settings, I ended up putting it on the same level as the interface name itself. This follows the same structure as is established in e.g. capture-mode interface definition (dpdk.interfaces, af-packet).
Note: In the matter of this I transferred the base *-cpu-set nodes from the list items as part of the Ticket 2321.
The CPU affinity interface-specific design I ended up with is:

threading:
  cpu-affinity:
    worker-cpu-set:
          cpu: [ "all" ]
          mode: "exclusive"
          interface-specific-cpu-set:
            - interface: "net_bonding0" # 0000:3b:00.0 # net_bonding0 # ens1f0
              cpu: [ "all" ]
            - interface: "net_bonding1"
              cpu: [ "all" ]

JSON visualization of this:

{
    "threading": {
        "cpu-affinity": {
            "worker-cpu-set": {
                "cpu": [
                    "all"
                ],
                "mode": "exclusive",
                "interface-specific-cpu-set": [
                    {
                        "interface": "net_bonding0",
                        "cpu": [
                            "all"
                        ]
                    },
                    {
                        "interface": "net_bonding1",
                        "cpu": [
                            "all"
                        ]
                    }
                ]
            }
        }
    }
}

This leaves the base {worker,receive,management}-cpu-set nodes in the original setting and as we know it. The internal code can handle this because the names of these are known beforehand and can "step down" to the children level to inspect cpu-affinity properties. The dynamic part - interface-specific-cpu-set contains all properties on the same level.

Other unsuccessful YAML designs

Specifying the NUMA affinity in the individual capture modes

Example

dpdk:
  - interface: "3b:00.0"
    threads: [ 2, 4, 6, 8]

But management / other threads would still be required to configure elsewhere. I liked the proposed approach more because all affinity-related settings are centralized.

Not having a list nodes in the YAML

threading:
  worker-cpu-set:
    net_bonding0:
      cpu: [ 1, 3, 5 ]

YAML library changes net_bonding to net-bonding when YAML is loaded - so this is a nogo.

Have interface-related properties as children

threading:
  worker-cpu-set:
    - interface: net_bonding0
        cpu: [ 1, 3, 5 ]

This is a disallowed construct in YAML, so another nogo. A similar pattern is currently with *-cpu-set nodes but the different here is that the parent list item has a value (in this case net_bonding0). That is what is causing problems. Interface related therefore need to right under "interface".

PRO TIP: A good way to visualize different YAML structures is by using YAML to JSON converters. Currently it creates object like:

[
  {
    "interface": "3b:00.0",
    "mode": "exclusive",
  }
]

Lukas Sismis added 10 commits January 3, 2025 16:24
Split the code into multiple functions for easier readability.
Part of Ticket 2321 work to remove unnecessary lists from
the config file.

Ticket: 2321
Provide backward compatibility with the previous configuration
format to allow smooth transition to the new format.
The commit adds docs about the new format and the introduced changes.
Using the new configuration format, it is now possible to set CPU affinity
settings per interface.

The threading.autopin option has been added to automatically use CPUs from the
same NUMA node as the interface. The autopin option requires
hwloc-devel / hwloc-dev to be installed and --enable-hwloc flag in configure
script.

Ticket: 7036
Copy link

codecov bot commented Jan 3, 2025

Codecov Report

Attention: Patch coverage is 46.68588% with 185 lines in your changes missing coverage. Please review.

Project coverage is 83.20%. Comparing base (6f937c7) to head (a717c60).
Report is 5 commits behind head on master.

Additional details and impacted files
@@            Coverage Diff             @@
##           master   #12336      +/-   ##
==========================================
- Coverage   83.26%   83.20%   -0.07%     
==========================================
  Files         912      913       +1     
  Lines      257643   257877     +234     
==========================================
+ Hits       214521   214557      +36     
- Misses      43122    43320     +198     
Flag Coverage Δ
fuzzcorpus 61.13% <0.00%> (-0.01%) ⬇️
livemode 19.43% <46.68%> (+0.03%) ⬆️
pcap 44.36% <0.34%> (-0.07%) ⬇️
suricata-verify 62.79% <0.34%> (-0.07%) ⬇️
unittests 59.13% <0.00%> (-0.06%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

@suricata-qa
Copy link

Information: QA ran without warnings.

Pipeline 24070

if (cpu_node != NULL) {
BuildCpuset(setname, cpu_node, &taf->cpu_set);
} else {
SCLogInfo("Unable to find 'cpu' node for set %s", setname);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What are the consequences of not finding the CPU node for the set?

An info message is emitted -- should it be a warning? If there are no consequences, should it be a debug?

Copy link
Contributor Author

@lukashino lukashino Jan 7, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed to warning as cpu setting in cpu-affinity node is quite crucial. Other properties - mode, priority left to debug printouts as they are less important in my opinion. When they are missing, the default values are picked up. That is e.g.

{
            .name = "worker-cpu-set",
            .mode_flag = EXCLUSIVE_AFFINITY,
            .prio = PRIO_MEDIUM,
            .lcpu = { 0 },
},

if (node != NULL) {
BuildCpuset(setname, node, cpuset);
} else {
SCLogDebug("Unable to find '%s' priority for set %s", priority, setname);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comment regarding consequences and message level.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Staying at the debug as it is not critical for prio and mode and sane defaults are picked up.


static bool AllCPUsUsed(ThreadsAffinityType *taf)
{
if (taf->lcpu < UtilCpuGetNumProcessorsOnline()) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggest return taf->lcpu >= UtilCpuGetNumProcessorsOnline();

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I see why the comment has "Outdated" label - this was changed in the subsequent comment.
When implementing the new per-interface CPU affinity settings I also refactored the code a bit. But to not have 2 changes (feature + refactoring) in one commit I divided it into two - so first the refactoring of the old codebase and then implementing the newly added feature.

This gets changed in the follow-up commit to this:

static bool AllCPUsUsed(ThreadsAffinityType *taf)
{
    for (int i = 0; i < MAX_NUMA_NODES; i++) {
        if (taf->lcpu[i] < UtilCpuGetNumProcessorsOnline()) {
            return false;
        }
    }
    return true;
}

so if it is ok, I would leave it as is.

static uint16_t GetNextAvailableCPU(ThreadsAffinityType *taf)
{
uint16_t cpu = taf->lcpu;
int attempts = 0;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wondering why there are 2 attempts; this seems non-deterministic? What will change between the 1st and 2nd attempts?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is relevant for the newer version too.

TLDR: To make it more readable, I added an extra check for an empty CPU set.

Full explanation:

There are two attempts to try to get some CPU, if you are on the 2nd (3rd pass) then you know your CPU affinity is incorrectly configured (empty).

So say you have 4 CPU cores.
Affinity is configured as:
CPU: [ 1,2 ]

Your interface is configured to run on 4 workers.

The affinity is converted to bits as 0 1 1 0 (1 is on indexes which CPUs we want to reside on).
The affinity variable (taf) has a property lcpu - the last CPU used to know on which core to continue in-between the worker assignment.

Worker 1
gets core 1, sets lcpu to 1 in the first pass

Worker 2
gets core 2, sets lcpu to 2 in the first pass

Worker 3
gets core 1, sets lcpu to 1 in the second pass - it had to evaluate core 3, then lcpu was reset to 0 and the number of attempts was increased by one

Worker 4
gets core 2, sets lcpu to 2 in the first pass

If you run into the third pass then you know something is wrong. I believe that can only happen with an empty CPU set so cpu: [ ]

But then I reevaluated the point and to make it easier to comprehend (to not get up until here) I added an extra check prior to running this function by checking CPU count
if (CPU_COUNT(&taf->cpu_set) == 0) {

and added extra comment to this function.

FatalError("unknown cpu_affinity node");
}
}
SCLogConfig("Found affinity definition for \"%s\"", setname);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: suggest consistent terminology usage -- both "affinity" and "CPU affinity" are used in messages.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tried to convert all "affinity" usages to "CPU affinity"

if_obj = HwLocDeviceGetByPcie(topology, tv->iface_name);
}

if (if_obj != NULL) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggest using this to conditionalize call to HwlocObjectDump

  if (sc_log_global_log_level >= SC_LOG_DEBUG)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI, I've found out there is SCLogGetLogLevel() function to get the log lvl

}
#endif /* HAVE_HWLOC */

static bool CPUIsFromNuma(uint16_t ncpu, uint16_t numa)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is called from a while loop in one instance. Could the CPU set for the NUMA node be retrieved once and then compared multiple times instead?

}
}

if (cpu >= 0)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: prefer use of { and }

if (!printed) {
printed = true;
SCLogWarning("threading.autopin option is enabled but hwloc support is not available. "
"Please recompile Suricata with hwloc support to enable this feature.");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: should mention reconfigure similar to existing log messages, e.g,

"NFQUEUE not enabled. Make sure to pass --enable-nfqueue to configure when"

@@ -175,6 +175,7 @@ int RunModeSetLiveCaptureAutoFp(ConfigIfaceParserFunc ConfigParser,
FatalError("TmThreadsCreate failed");
}
tv_receive->printable_name = printable_threadname;
tv_receive->iface_name = dev ? strdup(dev) : NULL;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

handle strdup failure.

@victorjulien
Copy link
Member

Can you update the PR description to start with what the scheme is in the PR? I has a helpful discussion of what it does not do, which leaves me guessing a bit :)

@lukashino
Copy link
Contributor Author

lukashino commented Jan 6, 2025

@victorjulien updated the PR description. The full example can be seen in suricata.yaml.in.
Reading through it, I thought that maybe the discussion there is a bit lengthy and winds into unnecessary details but it was an idea to lay down details and arguments why I did what I did so I'll not forget about them. You can view suricata.yaml.in and we can discuss on top of that.

@jlucovsky thanks for the feedback, I am going to address that shortly.

@ct0br0
Copy link

ct0br0 commented Jan 8, 2025

skimming through and had a few questions.

it looks like if you use all then anything on that numa will be be used?

does it automatically, or could you specify, not to use other say management threads? !management or just !0 any core you want?

could this work for pcap mode too? specify say numa0

kind of a nit, "all" doesn't really sound direct, may seem like you're just going to use every core you can. "auto" or "numa" or some combination may be more specific.

@lukashino
Copy link
Contributor Author

Moving part of the discussion points to #12359

@lukashino lukashino closed this Jan 8, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

5 participants