Commits · d30b73f05fa53e392086fffa6174e4138659f0a7 · nexedi / nxdtest

17 Jun, 2022 1 commit

test: fix flaky test_cancel_from_master · d30b73f0

Jérome Perrin authored Jun 17, 2022

We did not wait long enough for the tested process to set their signal
handler, as we can see on test history [1], this test fails often when
running on test node.

It always pass when I run locally, but if I adjust the _tmasterpoll
patch set to 0.01 second, it fails the same way.

1: https://erp5.nexedi.net/test_result_module/20220616-13A460D9B/2/TestResultLine_viewResultHistory

/reviewed-on !17

d30b73f0

27 Jan, 2022 4 commits

trun: Fix returncode when test run is canceled · 56e52da6

Kirill Smelkov authored Jan 26, 2022

To detect leaked processes in the end of the run, we are first waiting
for remaining test processes via procps. If the main waiting loop was
canceled without main test process first completed (p.poll calls never
returned !None), then the waitpid(pid=p.pid) system call will be done
via procps. Which leaves further waitpid(pid=p.pid) system call invoked
by subprocess to get -ECHILD error and artificially report 0 exit
status:

https://github.com/python/cpython/blob/2.7-0-g8d21aa21f2c/Lib/subprocess.py#L1094-L1107

-> Fix it by propagating .returncode from procps to Popen instance so
that it does not get lost.

On sample .nxdtest with

    TestCase('sleep', ['sleep', '10'])

Before the patch the output with CTRL+C was:

    $ nxdtest
    ...
    >>> sleep
    $ sleep 10
    ^C# Interrupt
    ok      sleep   0.604s  # 1t 0e 0f 0s       <-- NOTE
    # test run canceled
    # ran 1 test case:  1·ok                    <-- NOTE

After the patch the output becomes:

    $ nxdtest
    ...
    >>> sleep
    $ sleep 100
    ^C# Interrupt
    error   sleep   1.006s  # 1t 1e 0f 0s       <-- NOTE
    # test run canceled
    # ran 1 test case:  1·error                 <-- NOTE

/reviewed-by @jerome
/reviewed-on !16

56e52da6

Move detection of leaked processes to trun · cf300184

Kirill Smelkov authored Jan 26, 2022

When test run is canceled via signal, we want to send SIGTERM to trun
and let it kill/wait the tested processes. However until now we were
sending SIGTERM to trun and immediately checking whether some spawned
tested processes remain without giving any time for trun to complete first.

Fix this by moving the code that detects/cleanups leaked processes to
trun itself, and in the main nxdtest driver to only send SIGTERM to trun
and let trun do the cleanup/leaked processes detection.

This fixes erroneous "leaked process detected" as noticed by @jerome in
the following example:

    (!16 (comment 150835))
    $ nxdtest
    ...
    >>> sleep
    $ sleep 10
    ^C# Interrupt
    # stopping due to cancel
    # leaked pid=188877 'sleep' ['sleep', '10']		<-- "leaked" is a bit misleading here
    error   sleep   1.030s  # 1t 1e 0f 0s
    # test run canceled
    # ran 1 test case:  1·error

After the patch the output in this case becomes:

    $ nxdtest
    ...
    >>> sleep
    $ sleep 10
    ^C# Interrupt
    ok      sleep   0.604s  # 1t 0e 0f 0s
    # test run canceled
    # ran 1 test case:  1·ok

There is no more "leaked pid=...." emitted, but the change from "error" to "ok" status is wrong.
We will fix it in the next patch.

/reviewed-by @jerome
/reviewed-on !16

cf300184

Fix thinko when finally killing processes · 548a5325

Kirill Smelkov authored Jan 26, 2022

First we send SIGTERM to leaked processes, and then, after timeout,
intend to send SIGKILL to the leaked processes and to the main test
process if it is still alive.

However there is a thiko in the code: we were sending SIGKILL only to
the main test process, not to all leaked ones.

-> Fix it.

The bug was there since 0ad45a9c (Detect if a test leaks processes and
terminate them).

/reviewed-by @jerome
/reviewed-on !16

548a5325

Cancel test run on SIGINT/SIGTERM · b0cf277d

Kirill Smelkov authored Jan 24, 2022

In addition to canceling test run is master tells us to do so, also cancel the
run if interrupted or terminated.

With the following sample .nxdtest

    TestCase('sleep', ['sleep', '10'])

before the patch it does not react to CTRL+C:

    $ nxdtest
    ...
    >>> sleep
    $ sleep 10
    ^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C# ran 0 test cases.   <-- no reaction to CTRL+C, finishes after 10 seconds
    Traceback (most recent call last):
      File "/home/kirr/src/tools/go/py3.venv2/bin/nxdtest", line 33, in <module>
        sys.exit(load_entry_point('nxdtest', 'console_scripts', 'nxdtest')())
      File "/home/kirr/src/tools/go/py3.venv2/lib/python3.9/site-packages/decorator.py", line 232, in fun
        return caller(func, *(extras + args), **kw)
      File "/home/kirr/src/tools/go/pygolang-master/golang/__init__.py", line 103, in _
        return f(*argv, **kw)
      File "/home/kirr/src/wendelin/nxdtest/nxdtest/__init__.py", line 339, in main
        wg.wait()
    KeyboardInterrupt

after the patch:

    $ nxdtest
    ...
    >>> sleep
    $ sleep 10
    ^C# Interrupt					<-- prompt reaction to CTRL+C
    # stopping due to cancel
    # leaked pid=188877 'sleep' ['sleep', '10']
    error   sleep   1.030s  # 1t 1e 0f 0s
    # test run canceled
    # ran 1 test case:  1·error

Needs pygolang!17 to work.

/reviewed-by @jerome
/reviewed-on !16

b0cf277d

23 Dec, 2021 1 commit

fixup! trun: Spawn user test with sole regular uid/gid in /etc/{passwd,group} database · 6f75fa90

Kirill Smelkov authored Dec 23, 2021

Even though slapos.core tests seem to mock getgrnam calls[1], disk group
is being looked up in /etc/groups for real which fails, if there is no
such group, e.g. as

    ERROR: test_not_existing (slapos.tests.test_slapgrid.TestSlapgridWithDevPermManagerDevPermEmptyLsblk)
    ----------------------------------------------------------------------
    Traceback (most recent call last):
      File "/srv/slapgrid/slappart3/t/bvi/i/0/parts/slapos.core/slapos/tests/test_slapgrid.py", line 3246, in setUp
        self.setUpExpected()
      File "/srv/slapgrid/slappart3/t/bvi/i/0/parts/slapos.core/slapos/tests/test_slapgrid.py", line 3230, in setUpExpected
        gid = grp.getgrnam("disk").gr_gid
    KeyError: 'getgrnam(): name not found: disk'

-> Fix it up by also creating "disk" group in our namespace environment.

I'm not sure, but maybe the better long-term fix would be for
slapos.core tests not to access /etc/groups for real and to instead mock
access to this database completely.

Amends commits e6b7993c and b42ccfa5.

/reported-by @tomo
/reported-at slapos!1107 (comment 148758)

[1] https://lab.nexedi.com/nexedi/slapos.core/blob/1.7.1-28-g0b6bf2af4/slapos/tests/test_slapformat.py#L160-166

6f75fa90

20 Dec, 2021 13 commits

Propagate cancellation to spawned test jobs · 81b1907a

Kirill Smelkov authored Dec 20, 2021

So that if a test run is canceled in ERP5 UI, nxdtest stops its run
soon, instead of after several hours in case of
SlapOS.SoftwareReleases.IntegrationTest-* tests.

See the main (938b5455) and other patches for details.

/helped-and-reviewed-by @jerome
/reviewed-on !14

* y/cancel:
  Add test for cancel propagation
  tests: Run nxdtest.main for each test in a separate thread, so that pytest.timeout generally works
  Propagate cancellation to spawned test jobs
  Raise ctx.err() if test run was cancelled
  Stop spawned process softly on ctx cancel

81b1907a

Add test for cancel propagation · 5d656ccf
Jérome Perrin authored Dec 10, 2021
```
/reviewed-by @kirr
/reviewed-on !14
```
5d656ccf

tests: Run nxdtest.main for each test in a separate thread, so that pytest.timeout generally works · 8f59b689

Kirill Smelkov authored Dec 17, 2021

Factor-out the driver to run_nxdtest in a separate thread done in
0ad45a9c (Detect if a test leaks processes and terminate them) from
test_run_procleak into run_nxdtest, so that all tests are run this way
for pytest.timeout, if requested, to work universally for all tests.

The logic of how we run nxdtest under tests and handle timeout will be,
hopefully, soon reworked more, but it is anyway good to keep this logic
in only one place.

/reviewed-by @jerome
/reviewed-on !14

8f59b689

Propagate cancellation to spawned test jobs · 938b5455

Kirill Smelkov authored Dec 07, 2021

A user might cancel test result in ERP5 UI if e.g. some misbehaviour is
detected and a new revision is ready to be tested. This works by
test_result.start() returning None - indicating that there is no more
test_result_lines to exercise. Master also indicates this cancellation
via test_result.isAlive() returning False, but until now we were not
using that information and were always waiting for completion of current
test job that is already spawned.

This works well in practice if individual tests are not long, but e.g.
for SlapOS.SoftwareReleases.IntegrationTest-* it is not good, because
there an individual test might takes _hours_ to execute.

-> Fix it by first setting global context to where we'll propagate
cancellation from test_result.isAlive, and by using that context as the
base for all other activities. This should terminate spawned test
process if test_result is canceled.

The interval to check is picked up as 5 minutes not to overload master.
@jerome says that

    We now have 341 active test nodes, but sometimes we are using
    more, we did in the past to stress test some new machines.

    For the developer, if we reduce the waiting time from a few hours to 1
    minutes or 5 minutes seems more or less equivalent.

For 350 testnodes and each nxdtest checking its test_result status via
isAlive query to master every 5 minutes, it results in ~ 1 isAlive
request/second to master on average.

Had to change time to golang.time to use time.after().
Due to that time() and sleep() are changed to time.now() and
time.sleep() correspondingly.

/helped-and-reviewed-by @jerome
/reviewed-on !14

938b5455

Raise ctx.err() if test run was cancelled · bdd183fb

Kirill Smelkov authored Dec 17, 2021

This is normal rule to return an error from a task if the task has to
abort due to cancellation. We already do this in tee, but not in the
function that is waiting for spawned process to complete(*).

-> Fix that. Wrap corresponding wg.wait() into try/except and check if
it fails due to cancellation, upon which we should not raise, but
instead should continue to finish current test_result_line and only
after stop test run normally (again without raise, but with regular one
line log entry).

(*) added in 0ad45a9c (Detect if a test leaks processes and terminate them)

/reviewed-by @jerome
/reviewed-on !14

bdd183fb

Stop spawned process softly on ctx cancel · 34cb7879

Kirill Smelkov authored Dec 17, 2021

In 0ad45a9c (Detect if a test leaks processes and terminate them) we
organized waiting for spawned processes with handing ctx and sending
SIGKILL to the main spawned process on ctx cancel even though other
leaked processes are always first sent with SIGTERM and - only after
shutdown timeout - later with SIGKILL.

This is too brutal. Rework the code to first send SIGTERM to the main
spawned test process too, and leverage SIGKILL only later after shutdown
timeout.

This will be tested in a later patch which exercises how cancel from
master is propagated.

/reviewed-by @jerome
/reviewed-on !14

34cb7879

Run each testcase with its own /tmp and /dev/shm · 5acd1359

Kirill Smelkov authored Dec 20, 2021

to detect after each test run leaked temporary files, leaked mount
entries, to isolate different test runs from each other, and to provide
tmpfs on /tmp for every test.

The main change and description is in patch1 (a191468f); the other
patches fix that up step-by-step to work for real for all our tests.

/helped-by @tomo
/helped-and-reviewed-by @jerome
/reviewed-on !13

* y/unshare:
  trun: Require FUSE to be working inside user-namespaces to activate them
  Factor checking whether user-namespaces are available into trun.userns_available()
  trun: Add test for how /etc/{passwd,group} is setup for spawned job
  trun: Spawn user test with sole regular uid/gid in /etc/{passwd,group} database
  trun: Deactivate most capabilities before spawning user test
  Run each testcase with its own /tmp and /dev/shm

5acd1359

trun: Require FUSE to be working inside user-namespaces to activate them · f5f52434

Kirill Smelkov authored Dec 17, 2021

FUSE is needed for wendelin.core, and if we don't check that FUSE works
inside and activate user namespaces on e.g. Linux 4.9 kernel,
wendelin.core will fail to function. Since today wendelin.core is
included into ERP5 as its base component(*) it is practical to require
FUSE-in-userns support unconditionally.

FUSE in user-namespaces started to work in Linux 4.18(+). Detect if it
should work inside via checking whether running kernel is newer. I
choosed this way for simplicity not to unroll test filesystem to try to
mount inside and also not to slow-down startup time.

(*) see nexedi/slapos!874
(+) see https://git.kernel.org/linus/da315f6e0398 and https://git.kernel.org/linus/8cb08329b080.

/reviewed-by @jerome
/reviewed-on nexedi/nxdtest!13

f5f52434

Factor checking whether user-namespaces are available into trun.userns_available() · ef41d960

Kirill Smelkov authored Dec 17, 2021

In the next patch we'll also add detection whether FUSE works inside
user namespaces. Before doing that stop duplicating related code in
between trun.py and nxdtest_test.py .

/reviewed-by @jerome
/reviewed-on !13

ef41d960

trun: Add test for how /etc/{passwd,group} is setup for spawned job · b42ccfa5

Jérome Perrin authored Dec 09, 2021

Even if we don't want to spawn full sshd to verify that ptys work, we can
still minimally test that inside user/group database is setup as expected.

/reviewed-by @kirr
/reviewed-on !13

b42ccfa5

trun: Spawn user test with sole regular uid/gid in /etc/{passwd,group} database · e6b7993c

Kirill Smelkov authored Dec 07, 2021

Even though libc.opentty stopped to insist on chown(group=tty) for
/dev/pts/*, openssh still wants to do it and fails, preventing sshd from
working. Fix it by spawning test workload with sole current user and
group being present in the password database.

We anyway don't have IDs for users/groups different from current uid/gid
mapped from current namespace, and the kernel, when seeing a file owned
by those, maps them to "nobody/nogroup" for existing files, and rejects
chown to those original IDs obtained from parent's namespace
/etc/{passwd,group} as EINVAL. For the same reason we don't try to mount
our own /dev/pts instance, because we have only current uid/gid mapped
to parent namespace and gid=5 maps to nogroup in parent. With existing
/dev/pts mount entries are only listed as having nogroup, and from
outside they _are_ owned by parent's tty group. If we would mount
/dev/pts anew, parent won't see our /dev/pts/* at all which moves us
a bit more far from desired behaviour.

Still keep root and nobody/nogroup as @jerome reports that without those
users Go tests fail on Debian9:

    slapos!1095 (comment 147177)
    slapos!1095 (comment 147201)

See added comment about all this for more details.

This patch fixes sshd to work under tryn.py(*)

No test because libc.openpty works both with and without this patch, and
it would need too spawn whole sshd under test to verify this.

(*) the diff for sshd 1) running successfully under regular user, and 2)
    previously failing under trun.py without this patch is below:

    diff --git a/regular-nopam.txt b/trun-nopam.txt
    index 378ccb6..5b96c08 100644
    --- a/regular-nopam.txt
    +++ b/trun-nopam.txt
    @@ -1,4 +1,4 @@
    -(neo) (z-dev) (g.env) kirr@deca:~/tmp/trashme/sshd$ /sbin/sshd -d -p 2222 -h `pwd`/ssh_host_rsa_key -o UsePAM=no
    +kirr@deca:~/tmp/trashme/sshd$ /sbin/sshd -d -p 2222 -h `pwd`/ssh_host_rsa_key -o UsePAM=no
     debug1: sshd version OpenSSH_8.4, OpenSSL 1.1.1k  25 Mar 2021
     debug1: private host key #0: ssh-rsa SHA256:y+ujVDqqFBXTclDM2NLy4GME7wReutLcUYOWAeriXdc
     debug1: setgroups() failed: Operation not permitted
    @@ -91,35 +91,13 @@ debug1: session_input_channel_req: session 0 req pty-req
     debug1: Allocating pty.
     debug1: session_new: session 0
     debug1: SELinux support disabled
    -Attempt to write login records by non-root user (aborting)
    -debug1: session_pty_req: session 0 alloc /dev/pts/2
    -debug1: server_input_channel_req: channel 0 request env reply 0
    -debug1: session_by_channel: session 0 channel 0
    -debug1: session_input_channel_req: session 0 req env
    -debug1: server_input_channel_req: channel 0 request shell reply 1
    -debug1: session_by_channel: session 0 channel 0
    -debug1: session_input_channel_req: session 0 req shell
    -Starting session: shell on pts/2 for kirr from 127.0.0.1 port 44106 id 0
    -debug1: Setting controlling tty using TIOCSCTTY.
    -
    -debug1: Received SIGCHLD.
    -debug1: session_by_pid: pid 693948
    -debug1: session_exit_message: session 0 channel 0 pid 693948
    -debug1: session_exit_message: release channel 0
    -debug1: session_by_tty: session 0 tty /dev/pts/2
    -debug1: session_pty_cleanup2: session 0 release /dev/pts/2
    -Attempt to write login records by non-root user (aborting)
    -debug1: session_by_channel: session 0 channel 0
    -debug1: session_close_by_channel: channel 0 child 0
    -Close session: user kirr from 127.0.0.1 port 44106 id 0
    -debug1: channel 0: free: server-session, nchannels 1
    -Received disconnect from 127.0.0.1 port 44106:11: disconnected by user
    -Disconnected from user kirr 127.0.0.1 port 44106
    +chown(/dev/pts/2, 1000, 5) failed: Invalid argument
     debug1: do_cleanup
     debug1: temporarily_use_uid: 1000/1000 (e=1000/1000)
     debug1: restore_uid: (unprivileged)
    +debug1: session_pty_cleanup2: session 0 release /dev/pts/2
    +Attempt to write login records by non-root user (aborting)
    +debug1: audit_event: unhandled event 12
     debug1: do_cleanup
     debug1: temporarily_use_uid: 1000/1000 (e=1000/1000)
     debug1: restore_uid: (unprivileged)
    -debug1: audit_event: unhandled event 12

    (see slapos!1095 (comment 147018))

/helped-and-reviewed-by @jerome
/reviewed-on !13

e6b7993c

trun: Deactivate most capabilities before spawning user test · a9af3a8a

Kirill Smelkov authored Dec 06, 2021

In the previous patch we asked unshare to keep capabilities so that FUSE
mounting works from under regular user. However full set of capabilities
is too much, and in particular if cap_dac_override is present(*), it means
that writes to files that have read-only permission, is not rejected by
kernel.

-> Adjust trun to retain only those capabilities that we actually need
   = CAP_SYS_ADMIN to mount things.

This should fix the following Go build failure:

    --- FAIL: TestReadOnlyWriteFile (0.00s)
        ioutil_test.go:90: Expected an error when writing to read-only file /tmp/TestReadOnlyWriteFile3940340549/blurp.txt
    FAIL
    FAIL	io/ioutil	0.053s

P.S. And if we would unshare to root instead (unshare -Umr) it should be
still a good idea to drop extra capabilities, as we still want to reject
writes to read-only files.

(*) see https://man7.org/linux/man-pages/man7/capabilities.7.html

/helped-and-reviewed-by @jerome
/reviewed-on !13

a9af3a8a

Run each testcase with its own /tmp and /dev/shm · a191468f

Kirill Smelkov authored Nov 29, 2021

and detect leaked temporary files and mount entries after each test run.

Background

Currently we have several testing-related problems that are
all connected to /tmp and similar directories:

Problem 1: many tests create temporary files for each run. Usually
tests are careful to remove them on teardown, but due to bugs, many kind
of tests, test processes being hard-killed (SIGKILL, or SIGSEGV) and
other reasons, in practice this cleanup does not work 100% reliably and
there is steady growth of files leaked on /tmp on testnodes.

Problem 2: due to using shared /tmp and /dev/shm, the isolation in
between different test runs of potentially different users is not
strong. For example @jerome reports that due to leakage of faketime's
shared segments separate test runs affect each other and fail:
https://erp5.nexedi.net/bug_module/20211125-1C8FE17

Problem 3: many tests depend on /tmp being a tmpfs instance. This are for
example wendelin.core tests which are intensively writing to database,
and, if /tmp is resided on disk, timeout due to disk IO stalls in fsync
on every commit. The stalls are as much as >30s and lead to ~2.5x overall
slowdown for test runs. However the main problem is spike of increased
latency which, with close to 100% probability, always render some test
as missing its deadline. This topic is covered in
https://erp5.com/group_section/forum/Using-tmpfs-for--tmp-on-testnodes-JTocCtJjOd

--------

There are many ways to try to address each problem separately, but they
all come with limitations and drawbacks. We discussed things with @tomo
and @jerome, and it looks like that all those problems can be addressed
in one go if we run tests under user namespaces with private mounts for
/tmp and /dev/shm.

Even though namespaces is generally no-go in Nexedi, they seem to be ok
to use in tests. For example they are already used via private_tmpfs
option in SlapOS:

https://lab.nexedi.com/nexedi/slapos/blob/1876c150/slapos/recipe/librecipe/execute.py#L87-103
https://lab.nexedi.com/nexedi/slapos/blob/1876c150/software/neoppod/instance-neo-input-schema.json#L121-124
https://lab.nexedi.com/nexedi/slapos/blob/1876c150/software/neoppod/instance-neo.cfg.in#L11-16
https://lab.nexedi.com/nexedi/slapos/blob/1876c150/software/neoppod/instance-neo.cfg.in#L30-34
https://lab.nexedi.com/nexedi/slapos/blob/1876c150/software/neoppod/instance-neo.cfg.in#L170-177
...
https://lab.nexedi.com/nexedi/slapos/blob/1876c150/stack/erp5/instance-zope.cfg.in#L227-230

Thomas says that using private tmpfs for each test would be a better
solution than implementing tmpfs for whole /tmp on testnodes. He also
reports that @jp is OK to use namespaces for test as long as there is a
fallback if namespaces aren't available.

-> So let's do that: teach nxdtest to run each test case in its own
private environment with privately-mounted /tmp and /dev/shm if we can
detect that user namespaces are available. In an environment where user
namespaces are indeed available this addresses all 3 problems because
isolation and being-tmpfs are there by design, and even if some files
will leak, the kernel will free everything when test terminates and the
filesystem is automatically unmounted. We also detect such leakage and
report a warning so that such problems do not go completely unnoticed.

Implementation

We leverage unshare(1) for simplicity. I decided to preserve uid/gid
instead of becoming uid=0 (= `unshare -Umr`) for better traceability, so
that it is clear from test output under which real slapuser a test is
run(*). Not changing uid requires to activate ambient capabilities so
that mounting filesystems, including FUSE-based needed by wendelin.core,
continue to work under regular non-zero uid. Please see
https://git.kernel.org/linus/58319057b784 for details on this topic. And
please refer to added trun.py for details on how per-test namespace is setup.

Using FUSE inside user namespaces requires Linux >= 4.18 (see
https://git.kernel.org/linus/da315f6e0398 and
https://git.kernel.org/linus/8cb08329b080), so if we are really to use
this patch we'll have to upgrade kernel on our testnodes, at least where
wendelin.core is used in tests.

"no namespaces" detection is implemented via first running `unshare ...
true` with the same unshare options that are going to be used to create
and enter new user namespace for real. If that fails, we fallback into
"no namespaces" mode where no private /tmp and /dev/shm are mounted(%).

(*) for example nxdtest logs information about the system on startup:

    date:   Mon, 29 Nov 2021 17:27:04 MSK
    xnode:  slapuserX@test.node
    ...

(%) Here is how nxdtest is run in fallback mode on my Debian 11 with
    user namespaces disabled via `sysctl kernel.unprivileged_userns_clone=0`

    (neo) (z-dev) (g.env) kirr@deca:~/src/wendelin/nxdtest$ nxdtest
    date:   Thu, 02 Dec 2021 14:04:30 MSK
    xnode:  kirr@deca.navytux.spb.ru
    uname:  Linux deca 5.10.0-9-amd64 #1 SMP Debian 5.10.70-1 (2021-09-30) x86_64
    cpu:    Intel(R) Core(TM) i7-7600U CPU @ 2.80GHz

    >>> pytest
    $ python -m pytest
    # user namespaces not available. isolation and many checks will be deactivated.    <--- NOTE
    ===================== test session starts ======================
    platform linux2 -- Python 2.7.18, pytest-4.6.11, py-1.10.0, pluggy-0.13.1
    rootdir: /home/kirr/src/wendelin/nxdtest
    plugins: timeout-1.4.2
    collected 23 items

    nxdtest/nxdtest_pylint_test.py ....                      [ 17%]
    nxdtest/nxdtest_pytest_test.py ...                       [ 30%]
    nxdtest/nxdtest_test.py ......xx                         [ 65%]
    nxdtest/nxdtest_unittest_test.py ........                [100%]

    ============= 21 passed, 2 xfailed in 2.67 seconds =============
    ok      pytest  3.062s  # 23t 0e 0f 0s
    # ran 1 test case:  1·ok

/helped-by @tomo
/helped-and-reviewed-by @jerome
/reviewed-on !13

a191468f

09 Dec, 2021 2 commits

Log that master is connected and for which test_result this run is · 4fe9ee16

Kirill Smelkov authored Dec 09, 2021

Also log if master told us that we have nothing to do, and if the mode to run is local.

This should make it a bit more clear what is going on just by looking at
nxdtest log. See previous patch for more details and context.

For the reference: here is how updated output looks like in the normal case:

    date:   Thu, 09 Dec 2021 04:20:37 CET
    xnode:  slapuser7@rapidspace-testnode-001
    uname:  Linux rapidspace-testnode-001 4.9.0-16-amd64 #1 SMP Debian 4.9.272-1 (2021-06-21) x86_64
    cpu:    Intel(R) Xeon(R) CPU E5-2680 v2 @ 2.80GHz
    # running for test_result_module/20211209-170FD3998

    >>> pytest
    $ python -m pytest
    ============================= test session starts ==============================
    platform linux2 -- Python 2.7.18, pytest-4.6.11, py-1.9.0, pluggy-0.13.1
    rootdir: /srv/slapgrid/slappart7/t/dfp/soft/47cc86af27d234f0464630f2a0d22a6f/parts/zodbtools-dev
    collected 46 items

    zodbtools/test/test_analyze.py .                                         [  2%]
    zodbtools/test/test_commit.py ..                                         [  6%]
    zodbtools/test/test_dump.py ...                                          [ 13%]
    zodbtools/test/test_restore.py ..                                        [ 17%]
    zodbtools/test/test_tidrange.py .............................            [ 80%]
    zodbtools/test/test_zodb.py .........                                    [100%]

    ========================== 46 passed in 9.15 seconds ===========================
    ok      pytest  12.433s # 46t 0e 0f 0s
    # ran 1 test case:  1·ok

/reviewed-by @jerome
/reviewed-on !15

4fe9ee16

Always log system info and run summary, even if master tells us to do nothing · f8ec5787

Kirill Smelkov authored Dec 09, 2021

Nxdtest logs system information (bd91f6f1 "Include system information
into log output") and run summary at the end (9f413221 "Emit run summary
at the end"). However all that information currently is printed only if
master is successfully connected and actually tells to run the tests.

This behaviour is not very useful, because if nxdtest log output on
testnode is empty, it is not clear - whether it was "we have to do
nothing", or nxdtest stuck somewhere or something else.

For example
https://nexedijs.erp5.net/#/test_result_module/20211208-47D165B7/12 is
currently marked as Running for many long hours already. And the log on
testnode regarding nxdtest run is just:

    2021-12-08 15:42:17,314 INFO     $ PATH=/srv/slapgrid/slappart13/srv/slapos/soft/2956f419073cb2249ed953507fa6b173/bin:/opt/slapos/parts/bison/bin:/opt/slapos/parts/bzip2/bin:/opt/slapos/parts/gettext/bin:/opt/slapos/parts/glib/bin:/opt/slapos/parts/libxml2/bin:/opt/slapos/parts/libxslt/bin:/opt/slapos/parts/m4/bin:/opt/slapos/parts/ncurses/bin:/opt/slapos/parts/openssl/bin:/opt/slapos/parts/pkgconfig/bin:/opt/slapos/parts/python2.7/bin:/opt/slapos/parts/readline/bin:/opt/slapos/parts/sqlite3/bin:/opt/slapos/parts/swig/bin:/opt/slapos/bin:/opt/slapos/parts/patch/bin:/opt/slapos/parts/socat/bin:/usr/bin:/usr/sbin:/sbin:/bin SLAPOS_TEST_LOG_DIRECTORY=/srv/slapgrid/slappart13/var/log/testnode/dgd-xStX9safSG SLAPOS_TEST_SHARED_PART_LIST=/srv/slapgrid/slappart13/srv/shared:/srv/slapgrid/slappart13/t/dgd/shared /bin/sh /srv/slapgrid/slappart13/t/dgd/i/0/bin/runTestSuite --master_url $DISTRIBUTOR_URL --revision slapos=13977-ec686a708633f689382426063c21efbe3b2eab04,slapos.core=8698-91edab77ed36c160da8017cfdc1673fe7a8e10de --test_node_title rapidspace-testnode-008-3Nodes-DEPLOYTASK0 --test_suite SLAPOS-SR-TEST --test_suite_title SlapOS.SoftwareReleases.IntegrationTest-kirr.Python2 --project_title 'Rapid.Space Project'

without anything else.

With this patch nxdtest would print system information and report how many
tests it had run, if its invocation did not stuck.

In this patch we only move code that calls system_info and defer summary log
before code that connects to master. In the following patch we'll add more
logging around connecting to master.

/reviewed-by @jerome
/reviewed-on nexedi/nxdtest!15

f8ec5787

11 Nov, 2021 1 commit

Emit run summary at the end · 9f413221

Kirill Smelkov authored Nov 10, 2021

Sometimes there is zero testcases to be executed on testnodes, and
log output from nxdtest is just

    date:   Wed, 10 Nov 2021 12:31:50 MSK
    xnode:  kirr@deca.navytux.spb.ru
    uname:  Linux deca 5.10.0-9-amd64 #1 SMP Debian 5.10.70-1 (2021-09-30) x86_64
    cpu:    Intel(R) Core(TM) i7-7600U CPU @ 2.80GHz

it is not clear from such output did the run ended or the test got
stuck. After this patch it becomes

    date:   Wed, 10 Nov 2021 12:31:50 MSK
    xnode:  kirr@deca.navytux.spb.ru
    uname:  Linux deca 5.10.0-9-amd64 #1 SMP Debian 5.10.70-1 (2021-09-30) x86_64
    cpu:    Intel(R) Core(TM) i7-7600U CPU @ 2.80GHz
    # ran 0 test cases.

And in general, when there are several testcases to be run, it is helpful to
indicate end of such run and to print brief summary of result status for all
ran test cases. Example output:

    wendelin.core$ nxdtest -k test.wcfs
    date:   Wed, 10 Nov 2021 12:35:34 MSK
    xnode:  kirr@deca.navytux.spb.ru
    uname:  Linux deca 5.10.0-9-amd64 #1 SMP Debian 5.10.70-1 (2021-09-30) x86_64
    cpu:    Intel(R) Core(TM) i7-7600U CPU @ 2.80GHz
    >>> test.wcfs/fs:1
    ...
    ok      test.wcfs/fs:1  25.035s # 35t 0e 0f 0s

    >>> test.wcfs/fs:2
    ...
    ok      test.wcfs/fs:2  21.033s # 35t 0e 0f 0s

    >>> test.wcfs/fs:
    ...
    ok      test.wcfs/fs:   21.056s # 35t 0e 0f 0s
    # ran 3 test cases:  3·ok

/reviewed-by @jerome
/reviewed-on !12

9f413221

13 Aug, 2021 2 commits

support parsing pylint output · 72e36088

Jérome Perrin authored Jul 07, 2021

This parses pylint output with a simple regexp and counts one failure per
message reported.

This has not been tested yet, but we decided to apply this commit already.

/acked-by @kirr
/reviewed-on !11

72e36088

loadNXDTestFile: use `compile` for better tracebacks on errors · 7b5add47

Jérome Perrin authored Jul 19, 2021

When using compile with the actual file path, we can have better tracebacks
in case of errors.

before:

    Traceback (most recent call last):
      File "/srv/slapgrid/slappart3/srv/runner/software/9544feb19475590d240ba2d32743c0a0/bin/nxdtest", line 22, in <module>
        sys.exit(nxdtest.main())
      File "/srv/slapgrid/slappart3/srv/runner/software/9544feb19475590d240ba2d32743c0a0/parts/nxdtest/nxdtest/__init__.py", line 142, in main
        tenv = loadNXDTestFile('.nxdtest')
      File "/srv/slapgrid/slappart3/srv/runner/software/9544feb19475590d240ba2d32743c0a0/parts/nxdtest/nxdtest/__init__.py", line 75, in loadNXDTestFile
        six.exec_(src, g)
      File "<string>", line 77, in <module>
    NameError: name 'Pylint' is not defined

after:

    Traceback (most recent call last):
      File "/srv/slapgrid/slappart3/srv/runner/software/9544feb19475590d240ba2d32743c0a0/bin/nxdtest", line 22, in <module>
        sys.exit(nxdtest.main())
      File "/srv/slapgrid/slappart3/srv/runner/software/9544feb19475590d240ba2d32743c0a0/parts/nxdtest/nxdtest/__init__.py", line 142, in main
        tenv = loadNXDTestFile('.nxdtest')
      File "/srv/slapgrid/slappart3/srv/runner/software/9544feb19475590d240ba2d32743c0a0/parts/nxdtest/nxdtest/__init__.py", line 75, in loadNXDTestFile
        six.exec_(compile(src, os.path.realpath(path), 'exec'), g)
      File "/srv/slapgrid/slappart3/srv/runner/instance/slappart8/var/nxdtest/.nxdtest", line 77, in <module>
        summaryf=Pylint.summary,
    NameError: name 'Pylint' is not defined

/reviewed-by @kirr
/reviewed-in !10

7b5add47

12 Aug, 2021 1 commit

Detect if a test leaks processes and terminate them · 0ad45a9c

Kirill Smelkov authored Aug 11, 2021

For every TestCase nxdtest spawns test process to run with stdout/stderr
redirected to pipes that nxdtest reads. Nxdtest, in turn, tees those
pipes to its stdout/stderr until the pipes become EOF. If the test
process, in turn, spawns other processes, those other processes will
inherit opened pipes, and so the pipes won't become EOF untill _all_
spawned test processes (main test process + other processes that it
spawns) exit. Thus, if there will be any process, that the main test
process spawned, but did not terminated upon its own exit, nxdtest will
get stuck waiting for pipes to become EOF which won't happen at all if a
spawned test subprocess persists not to terminate.

I hit this problem for real on a Wendelin.core 2 test - there the main
test processes was segfaulting and so did not instructed other spawned
processes (ZEO, WCFS, ...) to terminate. As the result the whole test
was becoming stuck instead of being promptly reported as failed:

    runTestSuite: Makefile:175: recipe for target 'test.wcfs' failed
    runTestSuite: make: *** [test.wcfs] Segmentation fault
    runTestSuite: wcfs: 2021/08/09 17:32:09 zlink [::1]:52052 - [::1]:23386: recvPkt: EOF
    runTestSuite: E0809 17:32:09.376800   38082 wcfs.go:2574] zwatch zeo://localhost:23386: zlink [::1]:52052 - [::1]:23386: recvPkt: EOF
    runTestSuite: E0809 17:32:09.377431   38082 wcfs.go:2575] zwatcher failed -> switching filesystem to EIO mode (TODO)
    <LONG WAIT>
    runTestSuite: PROCESS TOO LONG OR DEAD, GOING TO BE TERMINATED

-> Fix it.

/reviewed-by @jerome
/reviewed-on !9

0ad45a9c

01 Dec, 2020 1 commit

use re.search to filter tests in --run · b5a74214

Jérome Perrin authored Nov 30, 2020

re.match only find matches where the pattern appears at the beginning of
the string, whereas re.search matches if the pattern appears anywhere in
the string. This is behavior is consistent with pytest, go test and ERP5's
runUnitTest

For more details, see the discussion from !6 (comment 121409)

/reviewed-on: !8

b5a74214

26 Nov, 2020 1 commit

Switch tee from threading.Thread to sync.WorkGroup · 1e6a1cc6

Kirill Smelkov authored Nov 24, 2020

The reason is that with threading.Thread if exception happens in that
spawned thread, this error is not propagated to main driver, while with
sync.WorkGroup an exception from any spawned worker is propagated back
to main. For example with the following injected error

    --- a/nxdtest/__init__.py
    +++ b/nxdtest/__init__.py
    @@ -267,6 +267,7 @@ def main():

     # tee, similar to tee(1) utility, copies data from fin to fout appending them to buf.
     def tee(ctx, fin, fout, buf):
    +    1/0
         while 1:

before this patch nxdtest behaves like ...

    (neo) (z4-dev) (g.env) kirr@deco:~/src/wendelin/nxdtest$ nxdtest
    date:   Tue, 24 Nov 2020 14:55:08 MSK
    xnode:  kirr@deco.navytux.spb.ru
    uname:  Linux deco 5.9.0-2-amd64 #1 SMP Debian 5.9.6-1 (2020-11-08) x86_64
    cpu:    Intel(R) Core(TM) i7-6600U CPU @ 2.60GHz

    >>> pytest
    $ python -m pytest
    Exception in thread Thread-2:
    Traceback (most recent call last):
      File "/usr/lib/python2.7/threading.py", line 801, in __bootstrap_inner
        self.run()
      File "/usr/lib/python2.7/threading.py", line 754, in run
        self.__target(*self.__args, **self.__kwargs)
      File "/home/kirr/src/wendelin/nxdtest/nxdtest/__init__.py", line 270, in tee
        1/0
    ZeroDivisionError: integer division or modulo by zero

    Exception in thread Thread-1:
    Traceback (most recent call last):
      File "/usr/lib/python2.7/threading.py", line 801, in __bootstrap_inner
        self.run()
      File "/usr/lib/python2.7/threading.py", line 754, in run
        self.__target(*self.__args, **self.__kwargs)
      File "/home/kirr/src/wendelin/nxdtest/nxdtest/__init__.py", line 270, in tee
        1/0
    ZeroDivisionError: integer division or modulo by zero

    error   pytest  0.583s  # 1t 1e 0f 0s
    (neo) (z4-dev) (g.env) kirr@deco:~/src/wendelin/nxdtest$ echo $?
    0

Here the error in another thread is only printed, but nxdtest is not aborted.
Above it reported "error", but e.g. when testing pygolang/py3 and raising an
error in tee it even reported it was succeeding
( !6 (comment 121393) ):

    slapuser34@vifibcloud-rapidspace-hosting-007:~/srv/runner/instance/slappart0$ ./bin/runTestSuite
    date:   Tue, 24 Nov 2020 12:51:23 MSK
    xnode:  slapuser34@vifibcloud-rapidspace-hosting-007
    uname:  Linux vifibcloud-rapidspace-hosting-007 4.19.0-6-amd64 #1 SMP Debian 4.19.67-2+deb10u2 (2019-11-11) x86_64
    cpu:    Intel(R) Xeon(R) CPU E5-2678 v3 @ 2.50GHz

    >>> thread
    $ python -m pytest
    Exception in thread Thread-1:
    Traceback (most recent call last):
      File "/srv/slapgrid/slappart34/srv/runner/shared/python3/5497998c60d97cbbf748337ccce21db2/lib/python3.7/threading.py", line 926, in _bootstrap_inner
        self.run()
      File "/srv/slapgrid/slappart34/srv/runner/shared/python3/5497998c60d97cbbf748337ccce21db2/lib/python3.7/threading.py", line 870, in run
        self._target(*self._args, **self._kwargs)
      File "/srv/slapgrid/slappart34/srv/runner/software/44fe7dd3f13ecd100894c6368a35c055/parts/nxdtest/nxdtest/__init__.py", line 268, in tee
        fout.write(data)
    TypeError: write() argument must be str, not bytes

    ok      thread  9.145s  # 1t 0e 0f 0s

    >>> gevent
    $ gpython -m pytest
    Exception in thread Thread-3:
    Traceback (most recent call last):
      File "/srv/slapgrid/slappart34/srv/runner/shared/python3/5497998c60d97cbbf748337ccce21db2/lib/python3.7/threading.py", line 926, in _bootstrap_inner
        self.run()
      File "/srv/slapgrid/slappart34/srv/runner/shared/python3/5497998c60d97cbbf748337ccce21db2/lib/python3.7/threading.py", line 870, in run
        self._target(*self._args, **self._kwargs)
      File "/srv/slapgrid/slappart34/srv/runner/software/44fe7dd3f13ecd100894c6368a35c055/parts/nxdtest/nxdtest/__init__.py", line 268, in tee
        fout.write(data)
    TypeError: write() argument must be str, not bytes

    ok      gevent  21.980s # 1t 0e 0f 0s

After this patch nxdtest correctly handles and propagates an error originating
in spawned thread back to main driver:

    (neo) (z4-dev) (g.env) kirr@deco:~/src/wendelin/nxdtest$ nxdtest
    date:   Tue, 24 Nov 2020 14:54:19 MSK
    xnode:  kirr@deco.navytux.spb.ru
    uname:  Linux deco 5.9.0-2-amd64 #1 SMP Debian 5.9.6-1 (2020-11-08) x86_64
    cpu:    Intel(R) Core(TM) i7-6600U CPU @ 2.60GHz

    >>> pytest
    $ python -m pytest
    Traceback (most recent call last):
      File "/home/kirr/src/wendelin/venv/z4-dev/bin/nxdtest", line 11, in <module>
        load_entry_point('nxdtest', 'console_scripts', 'nxdtest')()
      File "/home/kirr/src/wendelin/nxdtest/nxdtest/__init__.py", line 230, in main
        wg.wait()
      File "golang/_sync.pyx", line 237, in golang._sync.PyWorkGroup.wait
        pyerr_reraise(pyerr)
      File "golang/_sync.pyx", line 217, in golang._sync.PyWorkGroup.go.pyrunf
        f(pywg._pyctx, *argv, **kw)
      File "/home/kirr/src/wendelin/nxdtest/nxdtest/__init__.py", line 270, in tee
        1/0
    ZeroDivisionError: integer division or modulo by zero
    (neo) (z4-dev) (g.env) kirr@deco:~/src/wendelin/nxdtest$ echo $?
    1

NOTE sync.WorkGroup requires every worker to handle context cancellation, so
that whenever there is an error, all other workers are canceled. We add such
cancellation handling to tee but only lightly: before going to block in
read/write syscalls we check for whether ctx is canceled or not. However the
proper handling would be to switch file descriptors into non-block mode and to
select at every IO point on both potential IO events and potential
cancellation. This is left as TODO for the future.

/reviewed-on !7

1e6a1cc6

24 Nov, 2020 5 commits

Unittest and Python3 support · 40e2c4ab

Jérome Perrin authored Nov 24, 2020

These are the necessary changes to run `SlapOS.Eggs.UnitTest-*` and `SlapOS.SoftwareReleases.IntegrationTest-*` using nxdtest

See merge request !6

/reviewed-by @kirr

40e2c4ab

Flush output right after printing running test name · a129b560

Jérome Perrin authored Nov 18, 2020

If test program output on stderr (which is unbuffered), the
output of the test program will appear before output from nxdtest
advertising the program that is about to be executed, because nxdtest
stdout is buffered (testnode does not set PYTHONUNBUFFERED, and eventhough
nxdtest sets PYTHONUNBUFFERED in its own environ, this only applies to sub
processes)

a129b560

Support output from unittest module from python standard library · bca50060
Jérome Perrin authored Nov 17, 2020

bca50060
Also pass stderr output to summary method · 53064e71
Jérome Perrin authored Nov 17, 2020
```
While pytest sends everything in stdout, some other programs send on stderr.
```
53064e71

Treat program output as binary for python3 support · beb9d47e

Jérome Perrin authored Nov 17, 2020

While treating output as text would not really be impossible, treating it
as bytes seems a better choice because:
 - we don't have to make assumptions about what output encoding the test
   program is using for output
 - `tee` can just read stream output bytes by bytes without having to worry
   about multi-bytes characters
 - testnode protocol uses xmlrpc.client.Binary, which uses bytes.

Because using bufsize=1 implies reading subprocess output as text, we use
bufsize=0 instead in the subprocess.Popen call, to prevent buffering.

To make manipulation of strings and bytes easier, we add a dependency on
pygolang, so that we can use its strings utility functions.

Also add a few tests to verify general functionality.

beb9d47e

09 Nov, 2020 1 commit

PyTest: summary: Skip trail text after test summary · d829e9ca

Kirill Smelkov authored Nov 08, 2020

For example wendelin.core 2 uses pytest_unconfigure to unmount wcfs
servers that it spawned:

https://lab.nexedi.com/kirr/wendelin.core/blob/1be4d730/conftest.py#L27-53

This usually leads to some test as shown in added test.

-> Skip that text backwards and actually detect pytest summary line
instead of returning empty dict, because empty dict forces Nexedi ERP5
to treat such test run to have UNKNOWN status, e.g.

https://erp5.nexedi.net/test_result_module/20201108-C170850/11

/cc @jerome
/reviewed-on !5

d829e9ca

02 Nov, 2020 1 commit

Test nxdtest via nxdtest · 549cc2f6

Kirill Smelkov authored Oct 30, 2020

This allows to establish tests for Nxdtest itself under Nexedi testing
infrastructure.

/cc @jerome
/reviewed-on !4

549cc2f6

30 Oct, 2020 2 commits

Move to new home · 7417e7cf

Kirill Smelkov authored Oct 29, 2020

lab.nexedi.com/kirr/nxdtest -> lab.nexedi.com/nexedi/nxdtest

slapos!839 (comment 119120)
slapos!839 (comment 119142)

/cc @jerome
/reviewed-on !3

7417e7cf

Modularize + Start of tests · 8b32fd7a
Kirill Smelkov authored Oct 30, 2020
```
/cc @jerome
/reviewed-on !1
```
8b32fd7a

28 Oct, 2020 1 commit
- Start of tests · f91a050d
  Kirill Smelkov authored Oct 20, 2020
```
Verify pytest summary-parsing functionality.
```
  f91a050d
20 Oct, 2020 1 commit

Modularize · f74005e4

Kirill Smelkov authored Oct 20, 2020

Create real nxdtest module, so that things could be imported from there.
As the result switch `nxdtest` program from scripts to entry_point.

f74005e4

08 Oct, 2020 1 commit

py.test also can generate "errors" · 2c29b2c9

Kirill Smelkov authored Oct 08, 2020

Here is e.g. corresponding summary:

    ===== 12 failed, 14 passed, 4 xfailed, 12 error in 19.58 seconds =====

2c29b2c9

07 Oct, 2020 1 commit

Include system information into log output · bd91f6f1

Kirill Smelkov authored Oct 07, 2020

When checking logs on testnode, it is very useful to know which kernel
version is running inside (wendelin.core uses FUSE and depends on having
patched FUSE module for kernels < 5.2), which CPU is there, etc.

Format of printed output follows go benchmarking format and coincides
with start of `neotest info-local` (see footers on figures in
http://navytux.spb.ru/~kirr/neo.html and here:
https://lab.nexedi.com/kirr/neo/blob/1f26678b/go/neo/t/neotest#L585-874)

bd91f6f1