Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Need help for restore failed issue #2552

Open
melsamathew opened this issue Dec 18, 2024 · 15 comments
Open

Need help for restore failed issue #2552

melsamathew opened this issue Dec 18, 2024 · 15 comments

Comments

@melsamathew
Copy link

We are currently using CRIU version 3.15 for checkpoint and restore operations in our Linux application.
The environment consists of Glibc 2.40 and GCC 14.2.0.
While the checkpoint (dumping) process is successful, we encounter an issue during restore, where the process fails with the following error:

(00.064198) pie: 5495: - skip pagemap (00.064199) pie: 5495: - skip pagemap
(00.064201) pie: 5495: `- skip pagemap
(00.064297) Error (criu/cr-restore.c:1573): 5495 killed by signal 127: Unknown signal 127
(00.064335) Error (criu/cr-restore.c:2498): Restoring FAILED.
Our query is whether there could be any compatibility issues between CRIU 3.15 and the newer versions of Glibc (2.40) and GCC (14.2.0). Specifically, we would like to know:

Under what circumstances can CRIU throw an "unknown signal 127" during a restore?
How can we debug this issue further to pinpoint the cause?
We appreciate any insights or suggestions to help resolve this issue.

CRIU logs and information:
Restore.log

CRIU full dump/restore logs: [Dump.log](https://github.com/user-attachments/files/18176588/Dump.log)
Output of `criu --version`:

criu --version
Version: 3.15

Output of `criu check --all`:

criu check --all

Warn (criu/cr-check.c:859): Dirty tracking is OFF. Memory snapshot will not work.
Warn (criu/cr-check.c:1194): Loginuid restore is OFF.
Error (criu/cr-check.c:1216): UFFD is not supported
Error (criu/cr-check.c:1216): UFFD is not supported
Warn (criu/cr-check.c:1239): clone3() with set_tid not supported
Error (criu/cr-check.c:1281): Time namespaces are not supported
Looks good but some kernel features are missing
which, depending on your process tree, may cause
dump or restore failure.

Additional environment details:
The environment consists of Glibc 2.40 and GCC 14.2.0. Recently upgraded from Glibc 2.23 , criu was working with this glibc

@adrianreber
Copy link
Member

Your CRIU version is too old (as mentioned in the other ticket) to work with restartable sequences. Either update CRIU or you can set an environment variable to disable restartable sequences in glibc (as mentioned in the other ticket).

@melsamathew
Copy link
Author

melsamathew commented Dec 19, 2024

Thank you for your response and suggestions. I have upgraded CRIU to versions 3.17 and 4.0, and attempted to run the simple loop program (from (https://criu.org/Simple_loop) to verify if CRIU is functioning correctly. However, I encountered an issue during the dump process, which resulted in a segmentation fault.

In CRIU-4.0
ps -C test.sh
PID TTY TIME CMD
8401 pts/1 00:00:00 test.sh
criu dump -vvvv -o dump.log -t 8401 --shell-job && echo OK
Segmentation fault (core dumped)

Dump log ends in

00.045690) Add cgroup ns 8 pid 7153
(00.045693) cg: Dumping cgroups for thread 7153
(00.045727) cg: - New css ID 1 (00.045729) cg: - [cpuset,cpu,cpuacct,blkio,memory,devices,freezer,perf_event,hugetlb,pids,rdma] -> [/] [0]
(00.045732) cg: Set 1 is criu one
Or when tried from root
(00.055347) Error (criu/parasite-syscall.c:88): si_code=4 si_pid=13325 si_status=11
(00.055351) Error (criu/parasite-syscall.c:95): 13325 was stopped by 11 unexpectedly

Could you please suggested how it can make success?

@adrianreber
Copy link
Member

As described in #1696 you need at least Linux Kernel version 5.13 for restartable sequences to work with CRIU. According to your log files you have 5.4.282-staros-v3-scale-64.

With the glibc version you are using restartable sequences are always used and you need at least kernel 5.13 and CRIU 3.17.

You can use a newer kernel, an older glibc or export the environment variable I mentioned earlier.

@melsamathew
Copy link
Author

Thanks for your inputs.. Can you tell the combinations of criu glibc and kernel which can be tried to make it work?
Also which version of CRIU is compatible with Glibc-2.40?

@adrianreber
Copy link
Member

If glibc >= 2.35 you need at least kernel 5.13 and CRIU 3.17.

@melsamathew
Copy link
Author

Thanks adrianreber. I understand that to work everything proper we need an upgraded kernel too.
Other than setting environment variables, is there any patches available to apply to CRIU-3.15 to work for restartable sequences, without upgrading kernel version?

@adrianreber
Copy link
Member

No

@melsamathew
Copy link
Author

Need some more help here.
Currently, we are unable to upgrade the kernel to version 5.13.
The issue is still persisting after setting the environment variable with glibc 2.40 and CRIU-3.15
We are now trying to debug on the above combination.Can you suggest how to further debug the unknown signal 127?
Are there any logs we can enable to identify what is triggering this signal?
Additionally, could the introduction of rseq in the new glibc be the primary cause of the restore failure?

@adrianreber
Copy link
Member

I guess you are setting the environment variable wrong. Can you post the exact steps you are doing. So that I can reproduce it.

@melsamathew
Copy link
Author

We have exported environment variable in our application as export GLIBC_TUNABLES=glibc.pthread.rseq=0
After setting it, following is the list of tunables.
/lib64/ld-linux-x86-64.so.2 --list-tunables
glibc.cpu.hwcaps:
glibc.cpu.plt_rewrite: 0 (min: 0, max: 2)
glibc.cpu.prefer_map_32bit_exec: 0 (min: 0, max: 1)
glibc.cpu.x86_data_cache_size: 0x8000 (min: 0x0, max: 0xffffffffffffffff)
glibc.cpu.x86_ibt:
glibc.cpu.x86_memset_non_temporal_threshold: 0x180000 (min: 0x4040, max: 0xffffffffffffffff)
glibc.cpu.x86_non_temporal_threshold: 0x180000 (min: 0x4040, max: 0xfffffffffffffff)
glibc.cpu.x86_rep_movsb_threshold: 0x2000 (min: 0x100, max: 0xffffffffffffffff)
glibc.cpu.x86_rep_stosb_threshold: 0x800 (min: 0x1, max: 0xffffffffffffffff)
glibc.cpu.x86_shared_cache_size: 0x200000 (min: 0x0, max: 0xffffffffffffffff)
glibc.cpu.x86_shstk:
glibc.elision.enable: 0 (min: 0, max: 1)
glibc.elision.skip_lock_after_retries: 3 (min: 0, max: 2147483647)
glibc.elision.skip_lock_busy: 3 (min: 0, max: 2147483647)
glibc.elision.skip_lock_internal_abort: 3 (min: 0, max: 2147483647)
glibc.elision.skip_trylock_internal_abort: 3 (min: 0, max: 2147483647)
glibc.elision.tries: 3 (min: 0, max: 2147483647)
glibc.gmon.maxarcs: 1048576 (min: 50, max: 2147483647)
glibc.gmon.minarcs: 50 (min: 50, max: 2147483647)
glibc.malloc.arena_max: 0x0 (min: 0x1, max: 0xffffffffffffffff)
glibc.malloc.arena_test: 0x0 (min: 0x1, max: 0xffffffffffffffff)
glibc.malloc.check: 0 (min: 0, max: 3)
glibc.malloc.hugetlb: 0x0 (min: 0x0, max: 0xffffffffffffffff)
glibc.malloc.mmap_max: 0 (min: 0, max: 2147483647)
glibc.malloc.mmap_threshold: 0x0 (min: 0x0, max: 0xffffffffffffffff)
glibc.malloc.mxfast: 0x0 (min: 0x0, max: 0xffffffffffffffff)
glibc.malloc.perturb: 0 (min: 0, max: 255)
glibc.malloc.tcache_count: 0x0 (min: 0x0, max: 0xffffffffffffffff)
glibc.malloc.tcache_max: 0x0 (min: 0x0, max: 0xffffffffffffffff)
glibc.malloc.tcache_unsorted_limit: 0x0 (min: 0x0, max: 0xffffffffffffffff)
glibc.malloc.top_pad: 0x20000 (min: 0x0, max: 0xffffffffffffffff)
glibc.malloc.trim_threshold: 0x0 (min: 0x0, max: 0xffffffffffffffff)
glibc.mem.decorate_maps: 0 (min: 0, max: 1)
glibc.mem.tagging: 0 (min: 0, max: 255)
glibc.pthread.mutex_spin_count: 100 (min: 0, max: 32767)
glibc.pthread.rseq: 0 (min: 0, max: 1)
glibc.pthread.stack_cache_size: 0x2800000 (min: 0x0, max: 0xffffffffffffffff)
glibc.pthread.stack_hugetlb: 1 (min: 0, max: 1)
glibc.rtld.dynamic_sort: 2 (min: 1, max: 2)
glibc.rtld.enable_secure: 0 (min: 0, max: 1)
glibc.rtld.nns: 0x4 (min: 0x1, max: 0x10)
glibc.rtld.optional_static_tls: 0x200 (min: 0x0, max: 0xffffffffffffffff)

@melsamathew
Copy link
Author

Basically unknown signal relates to "a standard error message that indicates a command could not be found or executed"
Is there chances in CRIU encountering such issue after glibc upgrade ?

@melsamathew
Copy link
Author

Hi adrianreber,
Can you help on this, was waiting for your reply.. Thanks

@melsamathew
Copy link
Author

Hi, we are trying to upgrade linux kernel. But it is having lot of dependencies with other package and it takes time. Can you please point out which area of linux kernel is depending on latest criu?

@adrianreber
Copy link
Member

Can you please point out which area of linux kernel is depending on latest criu?

As described in #1696 you need torvalds/linux@90f093f

RHEL8 for example has that patch backported on a 4.18 kernel.

But the problem you have is not really understood. I am at this point just assuming it is related to restartable sequences because it makes sense with what you described. It still could be something else. I would recommend that you try it, for testing, on an up to date Linux distribution which is known to work. If it works there you know it is your environment.

@melsamathew
Copy link
Author

Hi
Thanks for the patch, I tried applying this patch in this environment
Glibc:2.40, CRIU -3.17 Kernel 5.4.282 with the above patch.

Now could see restoring is failed with SEGFAULT
(00.062154) 17144: Error (criu/cr-restore.c:1508): 17428 killed by signal 11: Segmentation fault
(00.067228) Error (criu/cr-restore.c:1504): 17144 exited, status=1
(00.067259) Error (criu/cr-restore.c:2550): Restoring FAILED.
Earlier when I tried without this patch dumping itself was failes, but now its failing with SEGFAULT.
Is this a known issue or how can debug this more?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants