Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for agent self-restarting for Linux: Design #63

Closed
Tracked by #54
lchico opened this issue Aug 5, 2024 · 9 comments
Closed
Tracked by #54

Add support for agent self-restarting for Linux: Design #63

lchico opened this issue Aug 5, 2024 · 9 comments
Assignees
Labels

Comments

@lchico
Copy link
Member

lchico commented Aug 5, 2024

Parent issue:

Description

Outline the design and workflow for the restart command.

Functional requirements

  • The agent should accept a command to initiate a restart.
  • The agent should perform a clean restart, ensuring all processes are correctly shut down and restarted.
  • Log the restart event, including the timestamp and initiating command details.

Non-functional requirements

  • The restart process should not exceed a predefined time limit (e.g., 30 seconds) to minimize downtime.

Implementation restrictions

  • Start from the Spike issue: Agent command manager #4
  • Ensure that the implementation is thoroughly tested with unit tests and integration tests.
@lchico lchico self-assigned this Aug 5, 2024
@lchico lchico added level/task Task issue type/enhancement Enhancement issue module/agent labels Aug 5, 2024
@wazuhci wazuhci moved this to In progress in XDR+SIEM/Release 5.0.0 Aug 5, 2024
@lchico
Copy link
Member Author

lchico commented Aug 5, 2024

Update 2024-08-05

  • I have been gathering context and reading the parent issues.
  • I reviewed the current code and everything that might be related to the topic

@lchico
Copy link
Member Author

lchico commented Aug 7, 2024

Update

Confirm Module Restarts: This step involves checking the status of all restarted modules to ensure they are running as expected.
Shutdown Module: Wait for a graceful shutdown. If the module doesn't respond within a timeout. [option for force terminate it?].
Module Ready: Check if the module is ready to be shut down (e.g., no ongoing operations).
Handle Timeout Error: Log an error indicating the module failed to shut down gracefully. Handle the error (e.g., retry, escalate).
All Modules Shutdown: Check if all modules have been successfully shut down.
Restart Modules: Start the modules in the desired order (e.g., stateless first). Monitor module startup for errors.
Log Restart Event: If all modules are confirmed restarted, log the successful event.

Flowchart on mermaidchart
flowchart TD
    A[Receive Restart Command] --> B[Send Terminate Signals]
    B --> C[Identify Modules]
    C --> D[Loop through Modules]
    D --> E{Module Ready?}
    E -- Yes --> F[Shutdown Module]
    E -- No --> G[Check Timeout]
    G -- Yes --> H[Handle Timeout Error]
    G -- No --> D
    F --> D
    H --> D
    D --> I[All Modules Shutdown?]
    I -- Yes --> J[Restart Modules]
    I -- No --> D
    J --> K[Confirm Module Restarts]
    K --> L[Log Restart Event]

restart

Note:
Stateful Modules: Should implement mechanisms to save their state before responding positively to the "Ready for Shutdown?" check. They should also be prepared to restore state upon restart.
Stateless Modules: Can typically respond positively to the "Ready for Shutdown?" check immediately.

@wazuhci wazuhci moved this from In progress to Pending review in XDR+SIEM/Release 5.0.0 Aug 7, 2024
@lchico
Copy link
Member Author

lchico commented Aug 7, 2024

Update

After syncing with the team, it looks like I can continue with the next step for the parent issue. I was able to set up a Docker environment to work in. However, after reviewing some other issues, I realized that I still need to gather more information.

Component diagram

image
Get it from #2

Stateful module: use Agent comms API

Screenshot from 2024-08-07 18-18-08
Get it from #1 (comment)
Note: The currently identified stateful modules are: FIM, Inventory, and SCA.

Command manager

image
Get it from: #4 (comment)

@wazuhci wazuhci moved this from Pending review to In progress in XDR+SIEM/Release 5.0.0 Aug 7, 2024
@lchico
Copy link
Member Author

lchico commented Aug 9, 2024

Update

Based on the previous issues and images, I propose creating a CLI (Command Line Interface) that connects with the Client on the Agent Comms API to send commands to each module. This approach allows us to reuse the existing API communications without generating new code.

agent

Flowchart on mermaidchart
flowchart TD
 subgraph COMMS_API["Agent comms API"]
        C["Client"]

  end

  subgraph COMMAND_MANAGER["Command Manager"]
        D["Command Receiver"]
  end


 subgraph Agent["Agent"]
        B["Cli -<br> _command line interface_"]
        COMMS_API
        COMMAND_MANAGER

  end
    B --> COMMS_API
    C --> COMMAND_MANAGER

For the next step, the MVP, we can reuse the code implemented here (issue: Agent Command Manager #4).

Taking into account the type of message that should be generated by the Client to the Command Manager:

{
  "command": {
    "name": "local_agent",
    "type": "restart-agent"
  },
  "origin": {
    "moduleName": "client",
    "serverName": ""
  },
  "parameters": {
    "data": "all",
    "error": 0,
    "extra_args": [],
    "status": "pending"
  }
}

Some essential features for this MVP include:

  • Mock modules with timeout simulation capabilities
  • Feedback mechanism from the Executor
  • Logging functionality

We can exclude the following for now:

  • Command-line interface
  • Agent communications API

The primary focus should be on the reset process itself, rather than the initial trigger.

Some suggestions that came to my mind after this research:

  • Command Execution Prioritization

    • Should command execution be prioritized? For example, should a command like "restart agent" preempt other commands in the queue?
  • Restarting Specific Modules

    • Should we consider the possibility to restart a specific module. Selective Restart: The command interface could allow users to specify which module to restart, using parameters or flags to target specific modules. For example, restart --module FIM.
  • Sequence of Restarting Modules

    • Should there be a specific sequence, or can modules be restarted independently?

@wazuhci wazuhci moved this from In progress to Pending review in XDR+SIEM/Release 5.0.0 Aug 9, 2024
@vikman90 vikman90 added spike Spike and removed phase/mvp labels Aug 13, 2024
@vikman90
Copy link
Member

@lchico Thanks for the design proposal. I just want to review the Shutdown operation. Sometimes, modules take too long to stop. That's why they need a timeout on Shutdown too.

GJ!

@lchico
Copy link
Member Author

lchico commented Aug 14, 2024

@vikman90 Yes, the diagram wasn't very clear, so I've made some updates.

agent

Following up on our previous discussion with @MarcelKemp :

Command Execution Prioritization: This will not be implemented at this time.
Restarting Specific Modules: Yes, we should be able to restart specific modules.
Restart Sequence: A specific sequence is not required for now.

@MarcelKemp
Copy link
Member

LGTM!

@wazuhci wazuhci moved this from Pending review to Done in XDR+SIEM/Release 5.0.0 Aug 14, 2024
@lchico
Copy link
Member Author

lchico commented Jan 14, 2025

Update

  • I am reviewing possible implementations.
  • I am working on forking the process once the restart command is applied.

2025-01-15

  • Tested and it looks like this implementation is possible:
flowchart TD
    A["Receive Restart Command"] --> B["Identify Run Method <br> - systemd <br> - Manual"]
    B --> K["Systemd"] & L["Manual"]
    K --> C["Restart Service <br>with systemd"]
    L --> M{"Fork"}
    M --> N["Parent:<br>Report Restart"] & O["Child:<br>Stop Wazuh-Agent"]
    C --> D["Configuration: <br> - Timeout Stop 30s"]
    O --> D
    D -- Timeout --> F["Kill Service & Log"]
    F --> H["Start Wazuh-Agent"]
    D -- Shutdown Gracefully --> H
    H --> I["Confirm agent restarts"]
Loading

2025-01-16

  • I improved the code and tried to review the possible race conditions that could happen during the fork and the restart with systemd.
  • I also pushed the PR where I am testing.

@lchico lchico reopened this Jan 14, 2025
@wazuhci wazuhci moved this from Done to In progress in XDR+SIEM/Release 5.0.0 Jan 14, 2025
@lchico lchico linked a pull request Jan 17, 2025 that will close this issue
28 tasks
@lchico
Copy link
Member Author

lchico commented Jan 17, 2025

Update

  • Review the race condition with the team, exploring alternative solutions. We could potentially avoid sending a progress report, as we do with other commands, and simply send the final result as I am currently doing
flowchart TD
    A["Receive Restart Command"] --> B["Identify Run Method <br> - systemd <br> - Manual"]
    B --> K["Systemd"] & L["Manual"]
    K --> C["Restart Service <br>with systemd"]
    L --> M{"Fork"}
    M --> N["Parent: Continue<br>Agent excecution"] & O["Child:<br>Stop Wazuh-Agent"]
    C --> D["Configuration: <br> - Timeout Stop 30s"]
    O --> D
    D -- Timeout --> F["Kill Service & Log"]
    F --> H["Start Wazuh-Agent"]
    D -- Shutdown Gracefully --> H
    H --> I["Confirm agent restarts"]
Loading

2025-01-20

  • I was able to implement the report for the restart, avoiding the race condition. The diagram will now look like this:
flowchart TD
    A["Receive Restart Command"] --> Z["Report Restart<br>in Progress"]
    Z --> B["Identify Run Method <br> - systemd <br> - Manual"]
    B --> K["Systemd"] & L["Manual"]
    K --> C["Restart Service <br>with systemd"]
    L --> M{"Fork"}
    M --> N["Parent: Continue<br>Agent excecution"] & O["Child:<br>Stop Wazuh-Agent"]
    C --> D["Configuration: <br> - Timeout Stop 30s"]
    O --> D
    D -- Timeout --> F["Kill Service & Log"]
    F --> H["Start Wazuh-Agent"]
    D -- Shutdown Gracefully --> H
    H --> I["Confirm agent restarts"]
Loading

@wazuhci wazuhci moved this from In progress to Pending review in XDR+SIEM/Release 5.0.0 Jan 21, 2025
@wazuhci wazuhci moved this from Pending review to Done in XDR+SIEM/Release 5.0.0 Jan 27, 2025
@MarcelKemp MarcelKemp changed the title Add support for agent self-restarting: Design Phase Add support for agent self-restarting for Linux: Design Jan 28, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
Status: Done
Development

Successfully merging a pull request may close this issue.

3 participants