From 5d8db2f3edc52de1494c3b517275c77dea9b1d7f Mon Sep 17 00:00:00 2001 From: Fengguang Wu Date: Fri, 27 Oct 2017 08:31:12 +0200 Subject: [PATCH 1/3] doc: add initial announcement Signed-off-by: Fengguang Wu --- doc/announce.md | 177 ++++++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 177 insertions(+) create mode 100644 doc/announce.md diff --git a/doc/announce.md b/doc/announce.md new file mode 100644 index 000000000..75b269027 --- /dev/null +++ b/doc/announce.md @@ -0,0 +1,177 @@ +0-day kernel build/boot testing farm +==================================== + +(June 2012 MSR by Fengguang Wu ) + +the problem +----------- + +The linux kernel has a vibrant community and fast development cycles, which is +excellent. On the other hand, the large changesets carry bugs and regressions. +Judging by the pains that me as a typical kernel developer encountered in the +daily hacking, there are a lot improvements to be desired. + +Build errors are often regarded as trivial ones. However we obviously lack an +effective way to prevent many of them from leaking into Linus' tree, not to +mention the linux-next tree, where it hurts many -mm developers. + +According to Geert's "Build regressions/improvements in v3.4" report, there are +~100 known build bugs shipped with the official Linux 3.4 release. The numbers +are somehow exaggerated because it contains build failures for many less-cared +archs, but that fact still stroked me. + +The attached xfs.png and drm.png represent my initial build status for the +typical dev trees. Each red 'c' character indicates one commit that won't build +for one kconfig. A line full of 'c' indicates one build bug inherited from the +base tree (ie. Linus' tree); a range of 'c' characters mean a build error is +introduced and fixed _some time_ later, which will be a problem for bisects. + +Runtime oopses are more challenging. As you may discover in LKML, lots of the +bug reports are simply ignored, because it's often really hard to track down +user reported problems. Hard-to-reproduce bugs are virtually not fixable; bugs +for old kernels are not cared by upstream developers; regressions not bisected +down to one particular commit could kill quite some brain cells, and there is +the question "who is to blame for^W^Wown this bug?". To be frank, the only way +to guarantee the prompt fix of a bug is to explicitly tell the developer: hi, +your XXX commit triggered this YYY bug. + +It boils down to one question: How can we make sure every regressions are +caught, root caused and fixed in some timely and easy fashion? There are lots +of works to do in each development stage, and the part of problem I'm trying +to attack is: quality assurance in the very early development stage, as soon as +new commits are pushed to public git trees. + +0-day kernel build test farm +---------------------------- + +In order to effectively improve Linux kernel quality and fuel its R&D cycles, +I'm setting up this 0-day kernel build test farm with highlights: + +0. 0 efforts to use +1. 1-hour response time (aka. 0-day) +2. "brute-force" commit-by-commit tests +3. auto test all branches in all developers' git trees +4. automated error notification to the right developer + +### 0 efforts to use + +We need to encourage, but NOT rely on the developers' self-descipline to do +tests on their own. I noticed that even the most seasoned maintainers who +manage their own professional build tests may act carelessly at times and push +untested commits publicly. IMHO this is human nature that we need to face +rather than blame. Then there are the more typical developers who only build +and run their kernels for one config and hardware. We have to accept that not +every one will bother or have the time/resources to carry out thorough tests. + +So the most effective way for quickly improving Linux quality would be to run +a test farm that works 7x24 on all the new commits. I'm not trying sell shiny +test tools to the kernel developers (at least, it's not the NO.1 goal), but +rather take on efforts to set up and maintain one test farm and make it +perform well. + +The kernel developers are delighted to find that, all of a sudden, they are +backed by a professional build testing system. The responses have mostly been +positive, and the few negative ones did help improve the system. + +### 1-hour response time (aka. 0-day) + +This is indeed a very important and possible target. It creates excellent user +experiences, makes the developers feel like at home because they can hardly do +better even when kicking off tests on their own machines. It makes Intel look +good, professional and powerful, and brings Intel very close to the community. + +Quite a few developers (including myself) overuse linux-next as their catch-all +testbed..even for the silly build errors. linux-next is re-assembled and tested +on a daily basis and I'm trying to outrace it and get errors notified/fixed +before the linux-next merge. + +### auto test all branches in all developers' git trees + +There are nice tools to help developers to do in-house tests; there are well +established build farms that work daily on the linux-next tree. However, there +is still one big gap lying in between: the various dev branches inside the +various git trees asks for more 3rd party testing. + +Our test farm will auto grab all newly created or updated branches and make +sure every new piece of works are properly tested, hopefully before being +merged by linux-next as well as the non-rebaseable Linus/tip/net etc. upstream +trees. + +### "brute-force" commit-by-commit tests + +It's a common expectation for the developers to do bisectibility tests, however +there have been no way to *ensure* this. Perhaps, it was deemed impossible for +some central server(s) to carry out bisectibility tests for all the 10000+ +commits merged in one Linux release. However, my experiments show that, by +taking advantage of some optimizations, it only requires one single 2-socket +SandyBridge server to do basic build tests for each and every commit. And +adding more servers will further improve the test coverage and response time. + +The most important caveat is, if it takes half hour to build the 1st commit from +scratch, the following 10 commits (as incremental changes) typically only takes +another half hour to compile. In that sense, it's not really 'brute-force' +compilations. Considering the guarantees of bisectibility and the ability to +find out the right developer to notify, the cost is well deserved. + +### automated error notification to the right developer + +Compile errors are trivial ones after all. They are best suitable for automation. +That helps guarantee the response time: once human checks are involved, the added +delays will be unpredictable. And it will help reduce long term maintenance cost. + +current status +-------------- + +We are running two 2-socket SandyBridge compile servers. They build 300-400 +commits and ~10000 kernels per day. 30 kconfigs are tested for each commit. + +We are "routinely" catching 1-2 new build error(s) on each working day. New +build warnings and sparse check warnings are also discovered on a daily basis. + +Most of the built kernels will be boot tested. The supporting hardwares are +several less powerful boxes, each runs 4-12 kvm instances, each can boot test a +kernel in about 1 minute. Once boot up, some heavier tests on memory management, +I/O and trinity fuzzer will be selectively executed. This system is proved to +be good at catching runtime errors. For example, here is the list of bug +reports I sent: + + 11372 N F Jun 22 Cc LKML ( 200:0) &-&->Re: boot hang on commit "PM / ACPI: Fix suspend/resume regression caused by cpuidle cleanup." + 11995 N F Jun 23 Cc LKML ( 101:0) BUG: tracer_alloc_buffers returned with preemption imbalance + 12141 N F Jun 24 Cc LKML ( 39:0) boot hang on CONFIG_FB_VGA16 + 12142 F Jun 24 Cc LKML ( 77:0) vfs/for-next: NULL pointer dereference in sysfs_dentry_delete() + 606 F Jun 25 To Joern Engel ( 71:0) NULL dereference in logfs_get_wblocks() + 13017 N F Jun 26 Cc LKML ( 106:0) BUG: No init found on NFSROOT + 13019 F Jun 27 Cc LKML ( 90:0) `-> BUG: held lock freed! + + 534 F Jul 03 Cc LKML ( 44:0) genirq: Flags mismatch irq 4. 00000000 (serial) vs. 00000000 (lirc_sir) + 539 F Jul 03 Cc LKML (7640:2) [mac80211-next:for-john] WARNING: at /c/kernel-tests/net/net/wireless/core.c:471 wiphy_register+0 + 606 r F Jul 06 Cc LKML ( 351:1) general protection fault on ttm_init() + 626 F Jul 08 Cc LKML (3047:2) WARNING: __GFP_FS allocations with IRQs disabled (kmemcheck_alloc_shadow) + 645 r F Jul 09 Cc LKML (3324:2) rcu_dyntick and suspicious RCU usage + 659 F Jul 10 Cc LKML (5418:2) [kgdb:kgdb-next] KGDB: BP remove failed: ffffffff81026ed0 + 662 F Jul 10 Cc LKML (5019:2) [Staging/speakup] BUG: spinlock trylock failure on UP on CPU#0, trinity-child0/484 + 663 F Jul 10 Cc LKML (2999:2) linux-next: Early crashed kernel on CONFIG_SLOB + 664 F Jul 10 Cc LKML (3068:2) Kernel boot hangs on commit "switch fput to task_work_add" + 665 F Jul 10 To LKML (3643:2) isdnloop: stack-protector: Kernel stack is corrupted in: ffffffff81e5b55b + 666 F Jul 10 Cc LKML (4748:2) ftrace_ops_list_func() triggered WARNING: at kernel/lockdep.c:3506 + 667 F Jul 11 Cc LKML (2769:2) WARNING: at drivers/misc/kgdbts.c:813 run_simple_test() + +The pile of bug reports around July 10 are some aged bugs found by the newly +setup randconfig boot tests. Besides, I didn't send out two machine specific +bugs, which we may need to resolve on ourselves. + +It's been hard time for me to bring these tests up. However it seemed to pay +off. The initial number of bugs they exposed indicates they will be effective +in catching new regressions in the future. + +summary +------- + +Hopefully this will be a valuable long term project for the Linux community as +well as Intel. We are probably the best candidate to run these tests, not only +because hardware is cheap for Intel, but also that we are in the unique position +that have all the bleeding edge hardwares to test run the new kernels, and are +actually the most willing to make sure they fit well with each other. + +Thanks, +Fengguang From fcb47ccfc50c1f4da388aed9596c0acaac04a917 Mon Sep 17 00:00:00 2001 From: Fengguang Wu Date: Fri, 10 Nov 2017 20:34:11 +0800 Subject: [PATCH 2/3] lkp-bootstrap: run arbitrary job script Signed-off-by: Fengguang Wu --- rootfs/addon/etc/init.d/lkp-bootstrap | 17 ++++++++++------- 1 file changed, 10 insertions(+), 7 deletions(-) diff --git a/rootfs/addon/etc/init.d/lkp-bootstrap b/rootfs/addon/etc/init.d/lkp-bootstrap index be314c228..1e2825ed4 100755 --- a/rootfs/addon/etc/init.d/lkp-bootstrap +++ b/rootfs/addon/etc/init.d/lkp-bootstrap @@ -47,16 +47,19 @@ read_kernel_cmdline_vars # The job file is contained in the initrd -- no need to download it here. [ -n "$job" ] || job=$(echo /lkp/scheduled/*/*.yaml) # in case CONFIG_PROC_FS is not set -[ -e "$job" -o -e ${job%.yaml}.sh ] || { - echo $job does not exist, quit from LKP - exit 0 # to work with non-LKP boots -} -if [ "$job" != "${job%.sh}" ]; then - . $job +if [ ${job%.yaml} != $job ]; then + job_script=${job%.yaml}.sh else - . ${job%.yaml}.sh + job_script=$job fi + +[ -e "$job_script" ] || { + echo $job_script does not exist, quit from LKP + exit 0 # to work with non-LKP boots +} + +. $job_script export_top_env : ${user:=lkp} From 4dc852c2ab67cec3f728254b87d1b81025c41a5c Mon Sep 17 00:00:00 2001 From: Mike Rapoport Date: Mon, 12 Mar 2018 11:21:27 +0200 Subject: [PATCH 3/3] Add CRIU as functional test Signed-off-by: Mike Rapoport --- distro/depends/criu | 17 +++++++++++++++++ etc/functional-tests | 1 + jobs/criu.yaml | 5 +++++ lib/nresult_root.rb | 2 +- lib/result.rb | 3 ++- pack/criu | 13 +++++++++++++ stats/criu | 23 +++++++++++++++++++++++ tests/criu | 8 ++++++++ 8 files changed, 70 insertions(+), 2 deletions(-) create mode 100644 distro/depends/criu create mode 100644 jobs/criu.yaml create mode 100755 pack/criu create mode 100755 stats/criu create mode 100755 tests/criu diff --git a/distro/depends/criu b/distro/depends/criu new file mode 100644 index 000000000..e752ee5d5 --- /dev/null +++ b/distro/depends/criu @@ -0,0 +1,17 @@ +build-essential +libprotobuf-dev +libprotobuf-c0-dev +protobuf-c-compiler +protobuf-compiler +python-protobuf +libnet-dev +pkg-config +libnl-3-dev +python-ipaddr +libbsd0 +libbsd-dev +iproute2 +libcap-dev +libaio-dev +python-yaml +libnl-route-3-dev diff --git a/etc/functional-tests b/etc/functional-tests index 52d7cf325..6d582090d 100644 --- a/etc/functional-tests +++ b/etc/functional-tests @@ -10,3 +10,4 @@ kvm-unit-tests packetdrill suspend lkp-bug +criu diff --git a/jobs/criu.yaml b/jobs/criu.yaml new file mode 100644 index 000000000..8d391296b --- /dev/null +++ b/jobs/criu.yaml @@ -0,0 +1,5 @@ +suite: criu +testcase: criu +category: functional + +criu: diff --git a/lib/nresult_root.rb b/lib/nresult_root.rb index 6f937984e..ed1096fe1 100755 --- a/lib/nresult_root.rb +++ b/lib/nresult_root.rb @@ -286,7 +286,7 @@ class MResultRootTableSet 'qemu', 'rcutorture', 'suspend', 'trinity', 'ndctl', 'nfs-test', 'hwsim', 'idle-inject', 'mdadm-selftests', 'xsave-test', 'nvml', 'test_bpf', 'mce-log', 'perf-sanity-tests', 'update-ucode', 'reboot', 'cat', 'libhugetlbfs-test', 'ocfs2test', 'syzkaller', - 'perf_test', 'stress-ng', 'sof_test', 'fxmark'].freeze + 'perf_test', 'stress-ng', 'sof_test', 'fxmark', 'criu'].freeze OTHER_TESTCASES = ['0day-boot-tests', '0day-kbuild-tests', 'build-dpdk', 'build-sof', 'sof_test', 'build-nvml', 'build-qemu', 'convert-lkpdoc-to-html', 'convert-lkpdoc-to-html-css', diff --git a/lib/result.rb b/lib/result.rb index 1bafc0da9..631a983f8 100755 --- a/lib/result.rb +++ b/lib/result.rb @@ -43,7 +43,8 @@ class ResultPath < Hash 'kvm-unit-tests-qemu' => %w[path_params tbox_group rootfs kconfig compiler commit qemu_config qemu_commit run], 'nvml-unit-tests' => %w[path_params tbox_group rootfs kconfig compiler commit nvml_commit run], 'mbtest' => %w[path_params tbox_group rootfs kconfig compiler commit mbt_commit run], - 'sof_test' => %w[path_params tbox_group rootfs kconfig compiler commit sof_commit run] + 'sof_test' => %w[path_params tbox_group rootfs kconfig compiler commit sof_commit run], + 'criu' => %w[path_params tbox_group rootfs kconfig compiler commit criu_commut run], }.freeze def path_scheme diff --git a/pack/criu b/pack/criu new file mode 100755 index 000000000..5cf46d2da --- /dev/null +++ b/pack/criu @@ -0,0 +1,13 @@ +#!/bin/bash + +CONFIGURE_FLAGS="--arch=$arch" + +download() +{ + git_clone_update https://github.com/checkpoint-restore/criu +} + +install() +{ + cp -af $source_dir/* $BM_ROOT/ +} diff --git a/stats/criu b/stats/criu new file mode 100755 index 000000000..cb1a7735e --- /dev/null +++ b/stats/criu @@ -0,0 +1,23 @@ +#!/usr/bin/awk -f + +BEGIN { + nr_test = 0 +} + +/====== Run [0-9a-zA-Z_/]* in / { + flav = $5 + tname = $3 + nr_test += 1 +} + +/PASS/ { + printf("%s/%s.pass: 1\n", flav, tname) +} + +/FAIL at/ { + printf("%s/%s.fail: 1\n", flav, tname) +} + +END { + printf("total_test: %d\n", nr_test) +} diff --git a/tests/criu b/tests/criu new file mode 100755 index 000000000..fb4456f12 --- /dev/null +++ b/tests/criu @@ -0,0 +1,8 @@ +#!/bin/bash -x + +. $LKP_SRC/lib/debug.sh +. $LKP_SRC/lib/reproduce-log.sh + +cd $BENCHMARK_ROOT/criu/test || die "Cannot find CRIU dir" + +python zdtm.py run -a