Add support for agent self-restarting for Linux: Design #63

lchico · 2024-08-05T13:25:26Z

Parent issue:

Add support for agent self-restarting #54

Description

Outline the design and workflow for the restart command.

Functional requirements

The agent should accept a command to initiate a restart.
The agent should perform a clean restart, ensuring all processes are correctly shut down and restarted.
Log the restart event, including the timestamp and initiating command details.

Non-functional requirements

The restart process should not exceed a predefined time limit (e.g., 30 seconds) to minimize downtime.

Implementation restrictions

Start from the Spike issue: Agent command manager #4
Ensure that the implementation is thoroughly tested with unit tests and integration tests.

lchico · 2024-08-05T20:54:11Z

Update 2024-08-05

I have been gathering context and reading the parent issues.
I reviewed the current code and everything that might be related to the topic

lchico · 2024-08-07T01:47:25Z

Update

Confirm Module Restarts: This step involves checking the status of all restarted modules to ensure they are running as expected.
Shutdown Module: Wait for a graceful shutdown. If the module doesn't respond within a timeout. [option for force terminate it?].
Module Ready: Check if the module is ready to be shut down (e.g., no ongoing operations).
Handle Timeout Error: Log an error indicating the module failed to shut down gracefully. Handle the error (e.g., retry, escalate).
All Modules Shutdown: Check if all modules have been successfully shut down.
Restart Modules: Start the modules in the desired order (e.g., stateless first). Monitor module startup for errors.
Log Restart Event: If all modules are confirmed restarted, log the successful event.

Flowchart on mermaidchart

flowchart TD
    A[Receive Restart Command] --> B[Send Terminate Signals]
    B --> C[Identify Modules]
    C --> D[Loop through Modules]
    D --> E{Module Ready?}
    E -- Yes --> F[Shutdown Module]
    E -- No --> G[Check Timeout]
    G -- Yes --> H[Handle Timeout Error]
    G -- No --> D
    F --> D
    H --> D
    D --> I[All Modules Shutdown?]
    I -- Yes --> J[Restart Modules]
    I -- No --> D
    J --> K[Confirm Module Restarts]
    K --> L[Log Restart Event]

Note:
Stateful Modules: Should implement mechanisms to save their state before responding positively to the "Ready for Shutdown?" check. They should also be prepared to restore state upon restart.
Stateless Modules: Can typically respond positively to the "Ready for Shutdown?" check immediately.

lchico · 2024-08-07T22:18:45Z

Update

After syncing with the team, it looks like I can continue with the next step for the parent issue. I was able to set up a Docker environment to work in. However, after reviewing some other issues, I realized that I still need to gather more information.

Component diagram

Get it from #2

Stateful module: use Agent comms API

Get it from #1 (comment)
Note: The currently identified stateful modules are: FIM, Inventory, and SCA.

Command manager

Get it from: #4 (comment)

lchico · 2024-08-09T00:54:43Z

Update

Based on the previous issues and images, I propose creating a CLI (Command Line Interface) that connects with the Client on the Agent Comms API to send commands to each module. This approach allows us to reuse the existing API communications without generating new code.

Flowchart on mermaidchart

flowchart TD
 subgraph COMMS_API["Agent comms API"]
        C["Client"]

  end

  subgraph COMMAND_MANAGER["Command Manager"]
        D["Command Receiver"]
  end


 subgraph Agent["Agent"]
        B["Cli -<br> _command line interface_"]
        COMMS_API
        COMMAND_MANAGER

  end
    B --> COMMS_API
    C --> COMMAND_MANAGER

For the next step, the MVP, we can reuse the code implemented here (issue: Agent Command Manager #4).

Taking into account the type of message that should be generated by the Client to the Command Manager:

{
  "command": {
    "name": "local_agent",
    "type": "restart-agent"
  },
  "origin": {
    "moduleName": "client",
    "serverName": ""
  },
  "parameters": {
    "data": "all",
    "error": 0,
    "extra_args": [],
    "status": "pending"
  }
}

Some essential features for this MVP include:

Mock modules with timeout simulation capabilities
Feedback mechanism from the Executor
Logging functionality

We can exclude the following for now:

Command-line interface
Agent communications API

The primary focus should be on the reset process itself, rather than the initial trigger.

Some suggestions that came to my mind after this research:

Command Execution Prioritization
- Should command execution be prioritized? For example, should a command like "restart agent" preempt other commands in the queue?
Restarting Specific Modules
- Should we consider the possibility to restart a specific module. Selective Restart: The command interface could allow users to specify which module to restart, using parameters or flags to target specific modules. For example, restart --module FIM.
Sequence of Restarting Modules
- Should there be a specific sequence, or can modules be restarted independently?

vikman90 · 2024-08-13T09:30:08Z

@lchico Thanks for the design proposal. I just want to review the Shutdown operation. Sometimes, modules take too long to stop. That's why they need a timeout on Shutdown too.

GJ!

lchico · 2024-08-14T00:49:15Z

@vikman90 Yes, the diagram wasn't very clear, so I've made some updates.

Following up on our previous discussion with @MarcelKemp :

Command Execution Prioritization: This will not be implemented at this time.
Restarting Specific Modules: Yes, we should be able to restart specific modules.
Restart Sequence: A specific sequence is not required for now.

MarcelKemp · 2024-08-14T12:57:56Z

LGTM!

lchico · 2025-01-14T15:03:08Z

Update

I am reviewing possible implementations.
I am working on forking the process once the restart command is applied.

2025-01-15

Tested and it looks like this implementation is possible:

flowchart TD
    A["Receive Restart Command"] --> B["Identify Run Method <br> - systemd <br> - Manual"]
    B --> K["Systemd"] & L["Manual"]
    K --> C["Restart Service <br>with systemd"]
    L --> M{"Fork"}
    M --> N["Parent:<br>Report Restart"] & O["Child:<br>Stop Wazuh-Agent"]
    C --> D["Configuration: <br> - Timeout Stop 30s"]
    O --> D
    D -- Timeout --> F["Kill Service & Log"]
    F --> H["Start Wazuh-Agent"]
    D -- Shutdown Gracefully --> H
    H --> I["Confirm agent restarts"]

2025-01-16

I improved the code and tried to review the possible race conditions that could happen during the fork and the restart with systemd.
I also pushed the PR where I am testing.

lchico · 2025-01-17T22:13:13Z

Update

Review the race condition with the team, exploring alternative solutions. We could potentially avoid sending a progress report, as we do with other commands, and simply send the final result as I am currently doing

flowchart TD
    A["Receive Restart Command"] --> B["Identify Run Method <br> - systemd <br> - Manual"]
    B --> K["Systemd"] & L["Manual"]
    K --> C["Restart Service <br>with systemd"]
    L --> M{"Fork"}
    M --> N["Parent: Continue<br>Agent excecution"] & O["Child:<br>Stop Wazuh-Agent"]
    C --> D["Configuration: <br> - Timeout Stop 30s"]
    O --> D
    D -- Timeout --> F["Kill Service & Log"]
    F --> H["Start Wazuh-Agent"]
    D -- Shutdown Gracefully --> H
    H --> I["Confirm agent restarts"]

2025-01-20

I was able to implement the report for the restart, avoiding the race condition. The diagram will now look like this:

flowchart TD
    A["Receive Restart Command"] --> Z["Report Restart<br>in Progress"]
    Z --> B["Identify Run Method <br> - systemd <br> - Manual"]
    B --> K["Systemd"] & L["Manual"]
    K --> C["Restart Service <br>with systemd"]
    L --> M{"Fork"}
    M --> N["Parent: Continue<br>Agent excecution"] & O["Child:<br>Stop Wazuh-Agent"]
    C --> D["Configuration: <br> - Timeout Stop 30s"]
    O --> D
    D -- Timeout --> F["Kill Service & Log"]
    F --> H["Start Wazuh-Agent"]
    D -- Shutdown Gracefully --> H
    H --> I["Confirm agent restarts"]

Based on the previous diagram, I updated the PR: Add support for agent self-restarting on Linux #386 and fixed all the checks.

lchico self-assigned this Aug 5, 2024

lchico added level/task Task issue type/enhancement Enhancement issue module/agent labels Aug 5, 2024

wazuhci added this to XDR+SIEM/Release 5.0.0 Aug 5, 2024

wazuhci moved this to In progress in XDR+SIEM/Release 5.0.0 Aug 5, 2024

jotacarma90 mentioned this issue Aug 6, 2024

Add support for agent self-restarting #54

Closed

5 tasks

wazuhci moved this from In progress to Pending review in XDR+SIEM/Release 5.0.0 Aug 7, 2024

wazuhci moved this from Pending review to In progress in XDR+SIEM/Release 5.0.0 Aug 7, 2024

vikman90 added the phase/mvp label Aug 8, 2024

wazuhci moved this from In progress to Pending review in XDR+SIEM/Release 5.0.0 Aug 9, 2024

lchico mentioned this issue Aug 12, 2024

Add support for agent self-restarting for Linux #77

Closed

vikman90 added spike Spike and removed phase/mvp labels Aug 13, 2024

MarcelKemp closed this as completed Aug 14, 2024

wazuhci moved this from Pending review to Done in XDR+SIEM/Release 5.0.0 Aug 14, 2024

lchico reopened this Jan 14, 2025

wazuhci moved this from Done to In progress in XDR+SIEM/Release 5.0.0 Jan 14, 2025

lchico linked a pull request Jan 17, 2025 that will close this issue

Fixes status report #508

Closed

28 tasks

wazuhci moved this from In progress to Pending review in XDR+SIEM/Release 5.0.0 Jan 21, 2025

MarcelKemp closed this as completed Jan 27, 2025

wazuhci moved this from Pending review to Done in XDR+SIEM/Release 5.0.0 Jan 27, 2025

MarcelKemp changed the title ~~Add support for agent self-restarting: Design Phase~~ Add support for agent self-restarting for Linux: Design Jan 28, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add support for agent self-restarting for Linux: Design #63

Add support for agent self-restarting for Linux: Design #63

lchico commented Aug 5, 2024 •

edited

Loading

lchico commented Aug 5, 2024

lchico commented Aug 7, 2024 •

edited

Loading

lchico commented Aug 7, 2024

lchico commented Aug 9, 2024 •

edited

Loading

vikman90 commented Aug 13, 2024

lchico commented Aug 14, 2024

MarcelKemp commented Aug 14, 2024

lchico commented Jan 14, 2025 •

edited

Loading

lchico commented Jan 17, 2025 •

edited

Loading

Add support for agent self-restarting for Linux: Design #63

Add support for agent self-restarting for Linux: Design #63

Comments

lchico commented Aug 5, 2024 • edited Loading

Description

Functional requirements

Non-functional requirements

Implementation restrictions

lchico commented Aug 5, 2024

Update 2024-08-05

lchico commented Aug 7, 2024 • edited Loading

Update

lchico commented Aug 7, 2024

Update

Component diagram

Stateful module: use Agent comms API

Command manager

lchico commented Aug 9, 2024 • edited Loading

Update

vikman90 commented Aug 13, 2024

lchico commented Aug 14, 2024

MarcelKemp commented Aug 14, 2024

lchico commented Jan 14, 2025 • edited Loading

Update

2025-01-15

2025-01-16

lchico commented Jan 17, 2025 • edited Loading

Update

2025-01-20

lchico commented Aug 5, 2024 •

edited

Loading

lchico commented Aug 7, 2024 •

edited

Loading

lchico commented Aug 9, 2024 •

edited

Loading

lchico commented Jan 14, 2025 •

edited

Loading

lchico commented Jan 17, 2025 •

edited

Loading