Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: Cronicle shuts down randomly despite I am running simple Python scripts via "Test Plugin" #862

Open
2 tasks done
iamumairayub opened this issue Feb 7, 2025 · 15 comments
Labels

Comments

@iamumairayub
Copy link

Is there an existing issue for this?

  • I have searched the existing issues

What happened?

I have a simple Python script that makes a couple of requests to an API after every 30 seconds
It uses "Test Plugin" of yours
Not a CPU/RAM intensive script at all

I have created 15 users in Cronicle
They all run their own job, and have Cronicle opened in their browser to see live output
These scripts are supposed to run forever
But, after like every 12 hours, Cronicle crashes
As said, no CPU/RAM spikes on server as this a simplest script

Is it because we have too many browsers opened to view logs?

Or what else can be reason?

OS: Ubuntu 20.04.6 LTS
CPU(s): 4
On-line CPU(s) list: 0-3
Thread(s) per core: 2
RAM: 32 GB

My Cronicle service file

[Unit]
Description=Node Cronicle
After=network.target

[Service]
Type=forking
ExecStart=/opt/cronicle/bin/control.sh start
ExecStop=/opt/cronicle/bin/control.sh stop

Environment="CRONICLE_echo=0"
Environment="CRONICLE_foreground=0"

# Restart policy (Retry up to 3 times with a 5s delay)
Restart=on-failure
RestartSec=5s
StartLimitIntervalSec=30
StartLimitBurst=3

[Install]
WantedBy=multi-user.target

PS:

I have another server that runs like 150 concurrent Python scripts using "Shell Plugin"
It never crashes
Its on a very small server than the above mentioned
One thing is that, nobody has the browser opened to see those logs
I just open it once a day to see if any job failed

Node.js Version

v22.13.1

Cronicle Version

0.9.72

Server Setup

Single Server

Storage Setup

Local Filesystem

Relevant log output

Feb 06 11:45:13 selenoid control.sh[242790]: [1738860313.739][2025-02-06 11:45:13][selenoid][242790][WebServer][debug][9][Sending compressed HTTP response with gzip: 200 OK][{"Content-Type":"application/json","Access-Control-Allow-Origin":"*","Server":"Cronicle 1.0","Content-Length":53,"Content-Encoding":"gzip"}]
Feb 06 11:45:13 selenoid control.sh[242790]: [1738860313.739][2025-02-06 11:45:13][selenoid][242790][WebServer][debug][9][Request complete][]
Feb 06 11:45:13 selenoid control.sh[242790]: [1738860313.739][2025-02-06 11:45:13][selenoid][242790][WebServer][debug][9][Response finished writing to socket][{"id":"r41"}]
Feb 06 11:45:13 selenoid control.sh[242790]: [1738860313.739][2025-02-06 11:45:13][selenoid][242790][WebServer][debug][9][Request performance metrics:][{"scale":1000,"perf":{"total":23.674,"queue":0.06,"read":0.492,"filter":0.663,"process":20.537,"encode":0.603,"write":0.695},"counters":{"bytes_in":1397,"bytes_out":200,"num_requests":1}}]
Feb 06 11:45:13 selenoid control.sh[242790]: [1738860313.739][2025-02-06 11:45:13][selenoid][242790][WebServer][debug][9][Keeping socket open for keep-alives: c40][]
Feb 06 11:45:14 selenoid systemd[1]: cronicle.service: start operation timed out. Terminating.
Feb 06 11:45:14 selenoid control.sh[242790]: [1738860314.384][2025-02-06 11:45:14][selenoid][242790][Cronicle][debug][2][Caught SIGTERM][]
Feb 06 11:45:14 selenoid control.sh[242790]: [1738860314.384][2025-02-06 11:45:14][selenoid][242790][Cronicle][debug][2][Shutting down][]
Feb 06 11:45:14 selenoid control.sh[242790]: [1738860314.385][2025-02-06 11:45:14][selenoid][242790][Cronicle][debug][9][Deleting PID File: logs/cronicled.pid: 242790][]
Feb 06 11:45:14 selenoid control.sh[242790]: [1738860314.385][2025-02-06 11:45:14][selenoid][242790][Cronicle][debug][3][Stopping component: Cronicle][]
Feb 06 11:45:14 selenoid control.sh[242790]: [1738860314.385][2025-02-06 11:45:14][selenoid][242790][Cronicle][debug][2][Shutting down Cronicle][]
Feb 06 11:45:14 selenoid control.sh[242790]: [1738860314.385][2025-02-06 11:45:14][selenoid][242790][Cronicle][debug][4][Aborting local job: jm6tkj91a01: Shutting down server][{"params":{},"timeout":31536000,"catch_up":0,"queue_max":1000,"timezone":"Asia/Karachi","plugin":"pm5ks6l6o4v","category":"cm5ktjk9t5d","target":"allgrp","algo":"random","multiplex":0,"stagger":0,"retries":3,"retry_delay":3,"detached":0,"queue":0,"chain":"","chain_error":"","notify_success":"","notify_fail":"","web_hook":"","cpu_limit":0,"cpu_sustain":0,"memory_limit":0,"memory_sustain":0,"log_max_size":0,"notes":"","category_title":"9203325663236","group_title":"All Servers","plugin_title":"bot","now":1738860313,"source":"Manual (admin)","id":"jm6tkj91a01","time_start":1738860313.726,"hostname":"selenoid","event":"em5kuhu1j5f","event_title":"UMAIR","nice_target":"All Servers","command":"python -u /home/after_n.py","log_file":"/opt/cronicle/logs/jobs/jm6tkj91a01.log","pid":243349,"log_file_size":247,"special_shutdown":true}]
Feb 06 11:45:14 selenoid control.sh[242790]: [1738860314.386][2025-02-06 11:45:14][selenoid][242790][Cronicle][debug][9][Closing client socket: wRN9DcAIqI0a-1ZeAAAC][]
Feb 06 11:45:14 selenoid control.sh[242790]: [1738860314.387][2025-02-06 11:45:14][selenoid][242790][Cronicle][debug][5][Socket.io client disconnected: wRN9DcAIqI0a-1ZeAAAC (IP: ::ffff:223.123.95.88)][]
Feb 06 11:45:14 selenoid control.sh[242790]: [1738860314.387][2025-02-06 11:45:14][selenoid][242790][Cronicle][debug][9][Closing client socket: kxw7b07S0ZzDw9lwAAAG][]
Feb 06 11:45:14 selenoid control.sh[242790]: [1738860314.387][2025-02-06 11:45:14][selenoid][242790][Cronicle][debug][5][Socket.io client disconnected: kxw7b07S0ZzDw9lwAAAG (IP: ::ffff:51.159.220.143)][]
Feb 06 11:45:14 selenoid control.sh[242790]: [1738860314.387][2025-02-06 11:45:14][selenoid][242790][Filesystem][debug][9][Storing JSON Object: global/state][data/global/d3/36/e7/d336e75083d08c96b2e4bba57227c4e8.json]
Feb 06 11:45:14 selenoid control.sh[242790]: [1738860314.387][2025-02-06 11:45:14][selenoid][242790][Cronicle][debug][2][Caught SIGTERM][]
Feb 06 11:45:14 selenoid control.sh[242790]: [1738860314.388][2025-02-06 11:45:14][selenoid][242790][Cronicle][debug][2][Shutting down][]
Feb 06 11:45:14 selenoid control.sh[242790]: [1738860314.388][2025-02-06 11:45:14][selenoid][242790][Cronicle][debug][1][EMERGENCY: Shutting down immediately][]
Feb 06 11:45:14 selenoid systemd[1]: cronicle.service: Failed with result 'timeout'.
Feb 06 11:45:14 selenoid systemd[1]: Failed to start Node Cronicle.
Feb 06 11:45:19 selenoid systemd[1]: cronicle.service: Scheduled restart job, restart counter is at 1.
Feb 06 11:45:19 selenoid systemd[1]: Stopped Node Cronicle.
Feb 06 11:45:19 selenoid systemd[1]: Starting Node Cronicle...
Feb 06 11:45:19 selenoid control.sh[243387]: /opt/cronicle/bin/control.sh start: Starting up Cronicle Server...

Code of Conduct

  • I agree to follow this project's Code of Conduct
@jhuckaby
Copy link
Owner

jhuckaby commented Feb 7, 2025

Well Cronicle isn't "crashing". It's receiving a SIGTERM signal to shut down:

Feb 06 11:45:14 selenoid control.sh[242790]: [1738860314.384][2025-02-06 11:45:14][selenoid][242790][Cronicle][debug][2][Caught SIGTERM][]
Feb 06 11:45:14 selenoid control.sh[242790]: [1738860314.384][2025-02-06 11:45:14][selenoid][242790][Cronicle][debug][2][Shutting down][]

So, it shouldn't have anything to do with the number of users or browsers you have open. Cronicle can handle thousands easily. There is some problem with your systemd configuration:

Feb 06 11:45:14 selenoid systemd[1]: cronicle.service: Failed with result 'timeout'.
Feb 06 11:45:14 selenoid systemd[1]: Failed to start Node Cronicle.
Feb 06 11:45:19 selenoid systemd[1]: cronicle.service: Scheduled restart job, restart counter is at 1.

This is the smoking gun right here:

Feb 06 11:45:14 selenoid systemd[1]: Failed to start Node Cronicle.

So, systemd is failing to "START" Cronicle, but you said it was up for 12 hours before this happened? So something is VERY wrong with systemd. It doesn't realize that Cronicle has started successfully and is running happily, so it shuts it down. It sounds like it's still WAITING for it to start, and eventually gives up after 12 hours.

I have no idea why systemd would do this. I think the problem MAY be in your environment variables:

Environment="CRONICLE_echo=0"
Environment="CRONICLE_foreground=0"

Environment variables are always interpreted as strings, so these values CANNOT be set to zero in this way. The reason is, the variable comes into Node.js as "0" (with quotes), which is a TRUE value. So Cronicle is ENABLING echo and foreground modes!!!

Since you have your systemd service type set to "forking", this is a big problem, because you have Cronicle launching itself in FOREGROUND mode, which will NOT fork.

Please try removing these environment variables entirely. They default to disabled anyway, so there is no need, and no way, to set them to "0".

@iamumairayub
Copy link
Author

Thanks for your insight

Btw, I had same problem when I had that default Cronicle systemd config

[Unit]
Description=Node Cronicle
After=network.target

[Service]
Type=forking
ExecStart=/opt/cronicle/bin/control.sh start
ExecStop=/opt/cronicle/bin/control.sh stop

[Install]
WantedBy=multi-user.target

Any idea why would it happen with this systemd config?
It would crash and then I have to manually start it

@jhuckaby
Copy link
Owner

jhuckaby commented Feb 7, 2025

I think you have those environment variables set somewhere else (like in the container host or container config, or even /etc/environment), and they're being passed down to the process. It's behaving exactly as if CRONICLE_foreground and CRONICLE_echo are both set to true somewhere.

Try grepping the Cronicle log for "Applying env config override:" and see what's happening on startup:

grep 'Applying env config override:' /opt/cronicle/logs/Cronicle.log

@iamumairayub
Copy link
Author

above command results nothing, maybe because it has output from only current cronicle that was started, not the all previous ones

also

root@selenoid:~# echo $CRONICLE_foreground

root@selenoid:~# echo $CRONICLE_echo

gives empty outputs

any idea how should I catch this issue when it crashes next time?

any logs that I should enable?

@jhuckaby
Copy link
Owner

jhuckaby commented Feb 7, 2025

Restart the service (or reboot your server, so systemd starts Cronicle), THEN grep the log. The key is to find out what happens at startup. Check if Cronicle is forking, or not.

[1738906506.968][2025-02-06 21:35:06][m2ni.local][34982][Cronicle][debug][2][Spawning background daemon process (PID 34982 will exit)][["/usr/local/bin/node","/opt/cronicle/lib/main.js"]]

If you see this, it means that Cronicle is starting up as a forking daemon (which is what you want). If you DON'T see this, then somehow foreground is being set to true somewhere. Environment variable, config setting, CLI argument, something.

Also look for this after a reboot:

grep 'Applying env config override:' /opt/cronicle/logs/Cronicle.log

@iamumairayub
Copy link
Author

iamumairayub commented Feb 7, 2025

Ok I restarted entire server

grep 'Applying env config override:' /opt/cronicle/logs/Cronicle.log

empty results

grep 'env' /opt/cronicle/logs/Cronicle.log

gives https://pastebin.com/0yNh81NX

even though no job running at this moment
in this log, I can see both those END variables are set to 0

I can see this in /opt/cronicle/logs/Cronicle.log

[1738911316.322][2025-02-07 01:55:16][selenoid][819][Cronicle][debug][2][Spawning background daemon process (PID 819 will exit)][["/usr/bin/node","/opt/cronicle/lib/main.js"]]

@mikeTWC1984
Copy link

@iamumairayub
do you see anything suspicious in status/journal?

systemctl status cronicle

sudo journalctl -u cronicle

If you stop cronicle from UI, will systemd restart it?

@iamumairayub
Copy link
Author

Now that we have CRONICLE_echo=0 so nothing helpful.
last message I see is "Started Node Cronicle."

@iamumairayub
Copy link
Author

@jhuckaby ok so it crashed again

root@selenoid:~# systemctl status cronicle
● cronicle.service - Node Cronicle
     Loaded: loaded (/etc/systemd/system/cronicle.service; enabled; vendor preset: enabled)
     Active: failed (Result: core-dump) since Sat 2025-02-08 12:36:03 EST; 9h ago
    Process: 809 ExecStart=/opt/cronicle/bin/control.sh start (code=exited, status=0/SUCCESS)
   Main PID: 1089 (code=dumped, signal=ABRT)

Feb 07 01:55:15 selenoid systemd[1]: Starting Node Cronicle...
Feb 07 01:55:15 selenoid control.sh[809]: /opt/cronicle/bin/control.sh start: Starting up Cronicle Server...
Feb 07 01:55:16 selenoid control.sh[809]: /opt/cronicle/bin/control.sh start: Cronicle Server started
Feb 07 01:55:16 selenoid systemd[1]: Started Node Cronicle.
Feb 08 12:36:03 selenoid systemd[1]: cronicle.service: Main process exited, code=dumped, status=6/ABRT
Feb 08 12:36:03 selenoid systemd[1]: cronicle.service: Failed with result 'core-dump'.

last few lines from cat /opt/cronicle/logs/Cronicle.log

[1739036106.208][2025-02-08 12:35:06][selenoid][1089][Cronicle][debug][5][Job completed with error][{"params":{},"timeout":31536000,"catch_up":0,"queue_max":1000,"timezone":"Asia/Karachi","plugin":"pm5ks6l6o4v","category":"cm6rwwq6t10","target":"allgrp","algo":"random","multiplex":0,"stagger":0,"retries":0,"retry_delay":3,"detached":0,"queue":0,"chain":"","chain_error":"","notify_success":"","notify_fail":"","web_hook":"","cpu_limit":0,"cpu_sustain":0,"memory_limit":0,"memory_sustain":0,"log_max_size":0,"notes":"","category_title":"80305578946","group_title":"All Servers","plugin_title":"bot","now":1739035972,"source":"Manual (test.naeem)","id":"jm6wh48qw26","time_start":1739035973.192,"hostname":"selenoid","event":"em6rwyo1011","event_title":"test.naeem UID: 506373","nice_target":"All Servers","command":"python -u /home/test_bot/big_small_after_n.py","log_file":"/opt/cronicle/logs/jobs/jm6wh48qw26.log","pid":514332,"cpu":{"min":0.5,"max":7.1,"total":22.7,"count":13,"current":0.5},"mem":{"min":79503360,"max":79503360,"total":1033543680,"count":13,"current":79503360},"log_file_size":1906,"abort_reason":"Manually aborted by user: naeem","code":1,"description":"Job Aborted: Manually aborted by user: test.naeem","unknown":1,"complete":1}]
[1739036106.208][2025-02-08 12:35:06][selenoid][1089][Cronicle][debug][6][Storing job log: /opt/cronicle/logs/jobs/jm6wh48qw26.log: jobs/jm6wh48qw26/log.txt.gz][]
[1739036106.284][2025-02-08 12:35:06][selenoid][1089][Cronicle][debug][9][Job log stored successfully: jobs/jm6wh48qw26/log.txt.gz][]
[1739036106.284][2025-02-08 12:35:06][selenoid][1089][Cronicle][debug][9][Deleting local file: /opt/cronicle/logs/jobs/jm6wh48qw26.log][]
[1739036106.321][2025-02-08 12:35:06][selenoid][1089][Cronicle][debug][9][Successfully deleted local job log file: /opt/cronicle/logs/jobs/jm6wh48qw26.log][]
[1739036160.044][2025-02-08 12:36:00][selenoid][1089][Cronicle][debug][4][Scheduler Minute Tick: Advancing time up to: 2025/02/08 12:36:00][]


@jhuckaby
Copy link
Owner

jhuckaby commented Feb 9, 2025

Huh, I've never seen a core dump before. How strange. Is there a crash log?

/opt/cronicle/logs/crash.log

@iamumairayub
Copy link
Author

iamumairayub commented Feb 9, 2025

@jhuckaby /opt/cronicle/logs/crash.log is empty

I am using default /opt/cronicle/conf/config.json which comes when installing Cronicle

This is the only Cronicle server that is crashing

I have Cronicle on other servers as well, they never crash despite having 100+ jobs

Only difference between those vs this server, this is being accessed by 20 different users, all users keep their browsers open to monitor their event logs
And on this server, all users have a job that uses your "Test Plugin" to run a Python script
My all other servers dont use this plugin at all, they all use "Shell Plugin"

@jhuckaby
Copy link
Owner

jhuckaby commented Feb 9, 2025

I've never heard of Cronicle core dumping before. Some internet research has suggested that this could be an OOM (out of memory) event on that server. Having 20 users watching live logs will cause an increase in memory usage. Although you did say you have 32GB of RAM (is anything else running on that server that could have eaten up all the memory?), so I am really struggling to find potential causes.

Here are some things to try:

https://chatgpt.com/share/67a8e226-0128-8006-b71a-d7ba57747e6c

I'm particularly interested in any core dump files you can find on that server, and what they contain.

Quoting from that ChatGPT conversation above:

Since systemd reports that the process dumped core (code=dumped, signal=ABRT), you can look for the actual core dump

coredumpctl list

Inspect the core dump (if systemd-coredump is enabled):

coredumpctl info 1089

or to extract a backtrace:

coredumpctl gdb 1089

If you don’t have systemd-coredump enabled, core dumps may be saved to /var/lib/systemd/coredump/, /var/crash/, or /core (depending on settings).

I assume you should replace 1089 in the examples above with the core cump that your server found from the list command.

Good luck!

@iamumairayub
Copy link
Author

is anything else running on that server that could have eaten up all the memory?

I have aerokube/selenoid on this server, which sometimes runs 15 concurrent sessions, I just ran them now, and here is CPU/RAM usage, where RAM always stays low like this, but CPU sometimes spikes up to 90%

Image

I have also installed coredumpctl now

Here are some things to try:

I checked your chatgpt link, and tried every command as suggested, all commands return no results except

cat /var/crash/_usr_bin_node.0.crash

https://pastebin.com/Xz4EvsG5

Question, is there way to disable live cronicle logs?

@jhuckaby
Copy link
Owner

Thanks, so it died due to signal 6 (SIGABRT). I can't find anything else useful in that core dump, alas.

What Does Signal 6 (SIGABRT) Mean?

Signal 6 (SIGABRT) is an abort signal, typically triggered by:

  1. An explicit call to abort() in the code (often due to assertion failures).
  2. Uncaught exceptions in Node.js (if not handled properly).
  3. Memory corruption detected by glibc, causing it to abort the process.
  4. Out-of-memory (OOM) situations, though in that case, you’d usually see SIGKILL (signal 9) instead.

So my theory of a OOM doesn't sound very likely anymore, because you didn't see any evidence of an OOM event, and that would probably result in a SIGKILL, not a SIGABRT.

The fact that Cronicle's own crash log (/opt/cronicle/logs/crash.log) is missing indicates that this wasn't an internal Node.js related error. Those should ALWAYS generate a crash log, with a full stack trace. You said there wasn't one, so this indicates a system level issue.

I honestly don't know what else to try here. This is very mysterious, and not something I have ever seen from Cronicle before. And I've been running it in a live production environment with hundreds of servers since 2014. Up 24x7, thousands of jobs per day, and not a single crash.

I will say that I have only JUST begun testing my apps on Node.js v22. There may be something weird going on with v22 and Cronicle, but I can't imagine what that would be.

You could try downgrading Node to v20 or v18, but that's really grasping at straws. I can't think of any other cause.

I'm very sorry this is happening 😞

Question, is there way to disable live cronicle logs?

Not officially, no. I suppose you could try setting the custom_live_log_socket_url configuration property to a "bad" URL like http://127.0.0.1:9999 which should prevent the live log stream socket from connecting to anything. It should just show an error instead.

I'm sorry I wasn't of more help. This is a very unusual issue.

@iamumairayub
Copy link
Author

Let see what information coredumpctl has when it crashes next time

I'm very sorry this is happening 😞

Dont be sorry sir, you have made our lives easier by developing this Cronicle. I am grateful to you :)

Thanks for your time.

I might post more comments if it crashes again

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants