Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for agent self-restarting for Linux #77

Closed
Tracked by #54
lchico opened this issue Aug 12, 2024 · 4 comments · Fixed by #386
Closed
Tracked by #54

Add support for agent self-restarting for Linux #77

lchico opened this issue Aug 12, 2024 · 4 comments · Fixed by #386
Assignees
Labels

Comments

@lchico
Copy link
Member

lchico commented Aug 12, 2024

Parent issue:

Description

We can now start working on the implementation based on the design phase in the issue #63.

Functional requirements

  • The agent should accept a command to initiate a restart.
  • The agent should perform a clean restart, ensuring all processes are correctly shut down and restarted.
  • Log the restart event, including the timestamp and initiating command details.

Non-functional requirements

  • The restart process should not exceed a predefined time limit (e.g., 30 seconds) to minimize downtime.

Implementation restrictions

  • Start from the Spike issue: Agent command manager #4
  • Ensure that the implementation is thoroughly tested with unit tests and integration tests.
@lchico
Copy link
Member Author

lchico commented Aug 12, 2024

Update

[2024-08-12] I started doing some research on how to implement it.
[2024-08-13] Review command manager implementation.
[2024-08-14] Putting this issue on hold to work on the #83 issue.

@lchico
Copy link
Member Author

lchico commented Dec 4, 2024

2024-12-03

  • I have taken up the issue, gathered more information on recent changes, and started implementing a possible solution.

2024-12-04

  • I fixed some errors in my previous code; it is pending testing, but there is still work to be done on it.
  • Investigate why I encountered the following error:
[DEBUG] 1000: execute_process(curl --fail -L https://gitlab.gnome.org//GNOME/libxml2/-/archive/v2.11.7/libxml2-v2.11.7.tar.gz --create-dirs --output /build_wazuh/agent/wazuh-agent-5.0.0/src/vcpkg/downloads/GNOME-libxml2-v2.11.7.tar.gz.34390.part)
[DEBUG] 1000: cmd_execute_and_stream_data() returned 22 after  4244993 us
error: Missing GNOME-libxml2-v2.11.7.tar.gz and downloads are blocked by x-block-origin.
error: https://gitlab.gnome.org//GNOME/libxml2/-/archive/v2.11.7/libxml2-v2.11.7.tar.gz: curl failed to download with exit code 22
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed

curl: (22) The requested URL returned error: 503 

[DEBUG] /source/src/vcpkg/base/downloads.cpp(1030): 
[DEBUG] Time in subprocesses: 4244993us
[DEBUG] Time in parsing JSON: 2us
[DEBUG] Time in JSON reader: 0us
[DEBUG] Time in filesystem: 749us
[DEBUG] Time in loading ports: 0us
[DEBUG] Exiting after 4.2 s (4245307us)

Sadly, it looks like a temporary issue. However, I found a link that might provide some clues about what is happening. For now, I haven’t found a solution and can only wait for the server to become unblocked.

  • Understand how the mock server and the agent should be configured to complete the enrollment.

2024-12-05

  • Implemented the RestartAgent function to handle agent self-restarting by integrating StopAgent and StartAgent.
  • Conducted research on integrating systemd using sd_notify for service readiness. Found that notify_socket must be defined before calling the binary manually, leading to unclear implementation paths.
  • Identified issues with PID file handling and began debugging.

2024-12-06

  • Resolved PID file write issues and verified format compatibility for systemd.
  • Enhanced error handling in RestartAgent for clean shutdowns.
  • Tested restart functionality in various scenarios, including manual and systemd-controlled restarts.

2024-12-09

  • Improved the code by adding extra functionality to the lockfilehandler.
  • Implemented a timeout for the reset process.
  • Updated the postrm script to stop the service.

2024-12-10

  • Review the implementation and research the command handler and communicator.

2024-12-11

  • Researched self-restart options, testing fork and execl approaches.
  • Evaluated the effectiveness of these methods for improving application reliability.

@lchico
Copy link
Member Author

lchico commented Dec 13, 2024

2024-12-12

  • Analyze these possible implementations.
Option 1: fork(), execl() y setpgid()
void RestartAgent(const std::string& configFile, const char* programPath) {
    LogInfo("Restart: Stopping wazuh-agent.");
    StopAgent(configFile);
    int timeoutSeconds = 20;
    auto startTime = std::chrono::steady_clock::now();

    while ( "stopped" != unix_daemon::GetDaemonStatus(configFile) ) {
        auto elapsed = std::chrono::steady_clock::now() - startTime;
        if (std::chrono::duration_cast<std::chrono::seconds>(elapsed).count() > timeoutSeconds) {
            LogError("Timeout reached while stopping wazuh-agent.");
            return ;
        }
        LogInfo("Waiting for wazuh-agent to stop...");
        std::this_thread::sleep_for(std::chrono::milliseconds(100));
    }

    LogInfo("Restart: starting wazuh agent.");
    pid_t pid = fork();
    if (pid == 0) {
        // Child process
        execl(programPath, programPath, nullptr);
        exit(1);
    } else if (pid < 0) {
        LogError("Fork failed");
        exit(1);
    } else {
        // Parent process
        setpgid(pid, pid);
        exit(0);
    }
}
Option 2: Use execl()
void RestartAgent(const std::string& configFile, const char* programPath) {

    LogInfo("Restart: Stopping wazuh-agent.");
    StopAgent(configFile); // The parent process handles the stop signal
    int timeoutSeconds = 20;
    auto startTime = std::chrono::steady_clock::now();

    LogInfo("Waiting for wazuh-agent to stop...");
    while ( "stopped" != unix_daemon::GetDaemonStatus(configFile) ) {
        auto elapsed = std::chrono::steady_clock::now() - startTime;
        if (std::chrono::duration_cast<std::chrono::seconds>(elapsed).count() > timeoutSeconds) {
            LogError("Restart: Timeout reached while stopping wazuh-agent.");
            return ;
        }
        std::this_thread::sleep_for(std::chrono::milliseconds(100));
    }

    LogInfo("Restart: starting wazuh agent.");
    execl(programPath, programPath, nullptr);
}
Option 3: Maintain the same process and return to a StartAgent
void RestartAgent(const std::string& configFile)
{
    LogInfo("Restart: Stoping wazuh-agent.");
    StopAgent(configFile);
    int timeoutSeconds = 20;
    auto startTime = std::chrono::steady_clock::now();

    while ( "stopped" != unix_daemon::GetDaemonStatus(configFile) ) {
        auto elapsed = std::chrono::steady_clock::now() - startTime;
        // Check elapsed time
        if (std::chrono::duration_cast<std::chrono::seconds>(elapsed).count() > timeoutSeconds) {
            LogError("Timeout reached while stopping wazuh-agent.");
            return ;
        }
        LogInfo("Waiting wazuh-agent... be stopped.");
        std::this_thread::sleep_for(std::chrono::milliseconds(100));
    }

    LogInfo("Restart: starting wazuh agent.");
    StartAgent(configFile);
}

2024-12-13

  • Reviewed the new command structure definition implemented in the following PR
  • Added "restart" as a valid command and conducted some tests.
  • Implemented a potential solution, but further work is still needed.

2024-12-16

  • Self-restart is working, but multiple instances occur when the process runs with systemd or manually. Pending, investigation of the cause of the multiple instances issue.
  • Fix the uninstall issue when the service is running: I encountered the following issue:
apt remove wazuh-agent
Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
The following package was automatically installed and is no longer required:
  lsb-release
Use 'apt autoremove' to remove it.
The following packages will be REMOVED:
  wazuh-agent
0 upgraded, 0 newly installed, 1 to remove and 0 not upgraded.
After this operation, 15.5 MB disk space will be freed.
Do you want to continue? [Y/n] 
(Reading database ... 9917 files and directories currently installed.)
Removing wazuh-agent (5.0.0-0) ...
Call pid 4119 with sigterm to stop the service
/var/lib/dpkg/info/wazuh-agent.prerm: 43: kill: Illegal option -S
dpkg: error processing package wazuh-agent (--remove):
 installed wazuh-agent package pre-removal script subprocess returned error exit status 2
dpkg: too many errors, stopping
Errors were encountered while processing:
 wazuh-agent
Processing was halted because there were too many errors.
E: Sub-process /usr/bin/dpkg returned an error code (1)	

2024-12-19

  • Thanks @jr0me, your input helped me improve the implementation. As a result, I had to change how it was structured. Previously, I handled the self-restart as part of the agent, but now it is more independent.

2024-12-20

  • After reviewing the issue with @jr0me and analyzing the possible solutions, we found that the self-restart process will not function without an external monitor that can kill the agent process if it becomes blocked during the restart process. This situation is considered in cases where the agent cannot restart because some of the tasks did not finish on time, resulting in a timeout.

The issue could look something like this:

%%{init: {'themeVariables': {'maxWidth': '300px'}}}%%
graph TD
    A[Agent] --> B[Task Manager]
    B --> D[Module Manager]
    B --> E[Communicator]
    B --> F[Command Handler]
    F --> C[Self-Restart]
    C -->|Fails| G[Blocked Process <br/> Unable to Respond]
    G --> H[Unable to Kill the Process, <br/> Losing Control of the Process]
Loading

A possible solution would be to implement something like this:

%%{init: {'themeVariables': {'maxWidth': '300px'}}}%%
graph TD
    A[Agent] --> B[Task Manager]
    B --> D[Module Manager]
    B --> E[Communicator]
    B --> F[Command Handler]
    F --> C[Self-Restart]
    
    subgraph Monitoring
        M[Monitoring Process]
        T[Timeout - 30 seconds]
        K[Kill Blocked Process]
        N[Start New Agent]
    end
    
    C -->|Fails| G[Blocked Process <br/> Unable to Respond]
    G --> H[Unable to Kill the Process, <br/> Losing Control of the Process]
    
    C -->|Sent to Monitor| M
    M --> T
    T -->|If >30s| K
    K -->|Kills| A
    K --> N
    N --> B[Task Manager]
Loading

2024-12-23

  • I successfully implemented the solution based on the previous diagram. The monitoring and self-restart features are working as expected.
  • Updated the PR description and outlined the pending steps.

2025-01-13

  • Reviewed the issue, rebased, fixed conflicts, and updated the PR with the changes introduced during the rebase.
  • Investigated and fixed the issue with the Python mock server. Thanks to @Nicogp and @TomasTurina for helping me with this.
  • I solved some checks, but implementing some updates to the signal handler code is still pending.

@wazuhci wazuhci moved this from In progress to Blocked in XDR+SIEM/Release 5.0.0 Dec 18, 2024
@wazuhci wazuhci moved this from Blocked to In progress in XDR+SIEM/Release 5.0.0 Dec 20, 2024
@wazuhci wazuhci moved this from In progress to On hold in XDR+SIEM/Release 5.0.0 Dec 24, 2024
@wazuhci wazuhci moved this from On hold to In progress in XDR+SIEM/Release 5.0.0 Jan 13, 2025
@wazuhci wazuhci moved this from In progress to Blocked in XDR+SIEM/Release 5.0.0 Jan 14, 2025
@wazuhci wazuhci moved this from Blocked to In progress in XDR+SIEM/Release 5.0.0 Jan 21, 2025
@lchico
Copy link
Member Author

lchico commented Jan 22, 2025

2025-01-21

  • Update the code with the new design.
  • Update the PR description, improve the code, and start testing it.

2025-01-22

  • Found an issue when I tried to restart using systemd, but I was able to fix it.
  • Update the PR and left ready to review

2025-01-23

  • I rebased the code, made some changes, and addressed the comments.

2025-01-24

  • I rebased the code again, made some changes, and addressed the comments.

2025-01-25

  • Addressed the comments. The PR is ready for another review.

@wazuhci wazuhci moved this from In progress to Pending review in XDR+SIEM/Release 5.0.0 Jan 22, 2025
@MarcelKemp MarcelKemp changed the title Add support for agent self-restarting: Development Phase Add support for agent self-restarting for Linux Jan 28, 2025
@vikman90 vikman90 linked a pull request Jan 29, 2025 that will close this issue
@wazuhci wazuhci moved this from Pending review to Done in XDR+SIEM/Release 5.0.0 Jan 29, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
Status: Done
Development

Successfully merging a pull request may close this issue.

2 participants