GNU bug report logs

#61839 'guix offload' occasionally hangs on berlin, hitting max-silent-time

PackageSource(s)Maintainer(s)
guix PTS Buildd Popcon
Reply or subscribe to this bug. View this bug as an mbox, status mbox, or maintainer mbox

Report forwarded to othacehe@gnu.org, bug-guix@gnu.org:
bug#61839; Package guix. (Mon, 27 Feb 2023 10:38:02 GMT) (full text, mbox, link).


Acknowledgement sent to Ludovic Courtès <ludo@gnu.org>:
New bug report received and forwarded. Copy sent to othacehe@gnu.org, bug-guix@gnu.org. (Mon, 27 Feb 2023 10:38:02 GMT) (full text, mbox, link).


Message #5 received at submit@debbugs.gnu.org (full text, mbox, reply):

From: Ludovic Courtès <ludo@gnu.org>
To: bug-guix@gnu.org
Subject: [Cuirass] Build timeouts during evaluations
Date: Mon, 27 Feb 2023 11:37:49 +0100
Hello Guix!

We often get evaluation errors for ‘core-updates’, where the build log
reveals that this is due to a build timeout while building the Guix
instance that will be used for the evaluation.  Example:

--8<---------------cut here---------------start------------->8---
building of `/gnu/store/x9dyzywwmsp8vmpvpx54yhcnz7rqin0l-asciidoc-9.1.0.drv' timed out after 3600 seconds of silence
cannot build derivation `/gnu/store/bl7aq8999lmp6flbyw3k656m522jmcl3-git-2.39.1.drv': 1 dependencies couldn't be built
cannot build derivation `/gnu/store/ldajjyvb4np1zq9l9s09j0dpsbpscpan-gnulib-2022-12-06-1.440b528.drv': 1 dependencies couldn't be built
cannot build derivation `/gnu/store/fdks470j7xjw3dc6fiyw15hdsxj6cdbc-guile-gnutls-3.7.11.drv': 1 dependencies couldn't be built
cannot build derivation `/gnu/store/71pl6a38bd0b81qgi45wz2ymvlv0x7xn-guix-cli-core.drv': 1 dependencies couldn't be built

[...]

cannot build derivation `/gnu/store/m7r73wypxhhmv8cqza1x8g2zyq9k1xpg-profile.drv': 1 dependencies couldn't be built

[...]

ERROR:
  1. &store-protocol-error:
      message: "build of `/gnu/store/m7r73wypxhhmv8cqza1x8g2zyq9k1xpg-profile.drv' failed"
      status: 101
--8<---------------cut here---------------end--------------->8---

(From <https://ci.guix.gnu.org/eval/233402/log/raw>.)

Likewise for the ‘guile’ jobset:

--8<---------------cut here---------------start------------->8---
exporting path `/gnu/store/m2hyx7qk2j1bsvpj1qv89701klv8kqkk-guix-command'
exporting path `/gnu/store/vlm7401d5jvna6j7qdxhcr581cd8g8vh-guix-daemon-1.4.0-4.01fd830'
exporting path `/gnu/store/zv3c0anfj085d2lyxh4g4aahhczmcaj8-guix-daemon'
exporting path `/gnu/store/f26rw4haizf2h126k9ja9wk31cm49mj0-guix-17bd02433'
exporting path `/gnu/store/wc736va72zl2zz36817ahf3kc5fkfr69-inferior-script.scm'
building of `/gnu/store/yq9r44lic8cn4nphksbi5s3ppwfcsh1n-guile.drv' timed out after 3600 seconds of silence
cannot build derivation `/gnu/store/nrcsvww8j24rjzyin90w76qgbkgphhvp-profile.drv': 1 dependencies couldn't be built
Backtrace:
In ice-9/boot-9.scm:
  1752:10 12 (with-exception-handler _ _ #:unwind? _ # _)
In unknown file:
          11 (apply-smob/0 #<thunk 7fd8cbe13340>)
In ice-9/boot-9.scm:
    724:2 10 (call-with-prompt _ _ #<procedure default-prompt-handle?>)
In ice-9/eval.scm:
    619:8  9 (_ #(#(#<directory (guile-user) 7fd8cbe16c80>)))
In cuirass/ui.scm:
   104:10  8 (run-cuirass-command _ . _)
In cuirass/scripts/evaluate.scm:
   105:27  7 (cuirass-evaluate _)
In ice-9/boot-9.scm:
  1752:10  6 (with-exception-handler _ _ #:unwind? _ # _)
In guix/store.scm:
   658:37  5 (thunk)
  2123:25  4 (run-with-store #<store-connection 256.99 7fd8cad09dc0> ?)
In cuirass/scripts/evaluate.scm:
     87:8  3 (_ _)
In guix/store.scm:
  1995:38  2 (_ #<store-connection 256.99 7fd8a33e6d70>)
  1421:15  1 (_ #<store-connection 256.99 7fd8a33e6d70> _ _)
In ice-9/boot-9.scm:
  1685:16  0 (raise-exception _ #:continuable? _)

ice-9/boot-9.scm:1685:16: In procedure raise-exception:
ERROR:
  1. &store-protocol-error:
      message: "build of `/gnu/store/nrcsvww8j24rjzyin90w76qgbkgphhvp-profile.drv' failed"
      status: 101
--8<---------------cut here---------------end--------------->8---

Building this particular derivation is quick though:

--8<---------------cut here---------------start------------->8---
$ time guix build /gnu/store/yq9r44lic8cn4nphksbi5s3ppwfcsh1n-guile.drv --check
The following derivation will be built:
  /gnu/store/yq9r44lic8cn4nphksbi5s3ppwfcsh1n-guile.drv
building /gnu/store/yq9r44lic8cn4nphksbi5s3ppwfcsh1n-guile.drv...
(repl-version 0 1 1)
(values (value "/gnu/store/rgk1sc2zvllf7cgbyj5g28b9v2x6nvn3-guile/share/guile/site/3.0"))
successfully built /gnu/store/yq9r44lic8cn4nphksbi5s3ppwfcsh1n-guile.drv
successfully built /gnu/store/yq9r44lic8cn4nphksbi5s3ppwfcsh1n-guile.drv
/gnu/store/rgk1sc2zvllf7cgbyj5g28b9v2x6nvn3-guile

real    0m1.253s
user    0m0.751s
sys     0m0.034s
--8<---------------cut here---------------end--------------->8---

So the problem must be elsewhere.

One thing is that this particular .drv is quite big:

--8<---------------cut here---------------start------------->8---
$ guix size /gnu/store/yq9r44lic8cn4nphksbi5s3ppwfcsh1n-guile.drv
store item                                                       total    self
/gnu/store/wjr1sqxci75jfkvwzmmvlgypkyp3fvqi-guix-17bd024           105.0   105.0  42.9%
/gnu/store/6b3fp2kwqxw7cvnh6jz9g49n9k2n5avy-guix-packages-base-source    36.2    36.2  14.8%
/gnu/store/h3hjni2bd8f5kbn8w0g2xx2jnagfynnr-guile-e2ed33e           29.0    29.0  11.8%
/gnu/store/gfb26xn5f2qyrxzv847cd6iql6h8igh0-packages                23.1    23.1   9.4%
/gnu/store/dj8rq4jd8qgk4k1hzg8djcvi6hlbdb7m-doc                     22.3    22.3   9.1%
/gnu/store/a5z192qxni40ly92cknwxkpnxc8zxfgf-guix                     7.8     7.8   3.2%
/gnu/store/f8wg2y3rrh4dgi6znjrv8a1l4ymkc7rb-guix-packages-source     2.8     2.8   1.2%
/gnu/store/8iwm8x6472xnf7g6hfxzw029gxm1qbkq-guix-system-source       2.5     2.5   1.0%
/gnu/store/2dxbf11shgvs1cgvzz6wpnfra97f0sfw-guix-extra-source        2.5     2.5   1.0%
/gnu/store/q57az26zdgafic5b8xsbxlh4iwddvr86-doc                      2.1     2.1   0.9%
[…]
/gnu/store/0mxnx8l4fgigvd7gakwdk6hc6im4wnai-disarchive-mirrors       0.0     0.0   0.0%
total: 244.8 MiB
--8<---------------cut here---------------end--------------->8---

Merely transferring it for offloading takes time; depending on machine
load, perhaps it can hit the 1h max-silent timeout?  That’d mean the
throughput is below 70 KiB/s.  Plausible?

The closure of
/gnu/store/x9dyzywwmsp8vmpvpx54yhcnz7rqin0l-asciidoc-9.1.0.drv is
3.0 MiB and it builds in 4 seconds, so there must be something else
going on, perhaps a bug in libssh/Guile-SSH leading to stale
connections?

Thoughts?

Ludo’.




Severity set to 'important' from 'normal' Request was from Ludovic Courtès <ludo@gnu.org> to control@debbugs.gnu.org. (Wed, 01 Mar 2023 15:31:02 GMT) (full text, mbox, link).


Information forwarded to bug-guix@gnu.org:
bug#61839; Package guix. (Fri, 03 Mar 2023 14:25:02 GMT) (full text, mbox, link).


Message #10 received at 61839@debbugs.gnu.org (full text, mbox, reply):

From: Ludovic Courtès <ludo@gnu.org>
To: 61839@debbugs.gnu.org
Cc: Mathieu Othacehe <othacehe@gnu.org>
Subject: Re: bug#61839: [Cuirass] Build timeouts during evaluations
Date: Fri, 03 Mar 2023 15:24:16 +0100
Ludovic Courtès <ludo@gnu.org> skribis:

> exporting path `/gnu/store/f26rw4haizf2h126k9ja9wk31cm49mj0-guix-17bd02433'
> exporting path `/gnu/store/wc736va72zl2zz36817ahf3kc5fkfr69-inferior-script.scm'
> building of `/gnu/store/yq9r44lic8cn4nphksbi5s3ppwfcsh1n-guile.drv' timed out after 3600 seconds of silence

I experienced something similar while building from the command line on
berlin, so it looks like an offloading problem.

Ludo’.




Information forwarded to bug-guix@gnu.org:
bug#61839; Package guix. (Fri, 17 Mar 2023 13:22:02 GMT) (full text, mbox, link).


Message #13 received at 61839@debbugs.gnu.org (full text, mbox, reply):

From: Andreas Enge <andreas@enge.fr>
To: Ludovic Courtès <ludo@gnu.org>, 61839@debbugs.gnu.org
Subject: Re: Offloading problems on berlin
Date: Fri, 17 Mar 2023 14:21:03 +0100
Am Thu, Mar 16, 2023 at 02:50:53PM +0100 schrieb Ludovic Courtès:
> I occasionally offload from my laptop to machines at work but didn’t hit
> this bug, so we’ll have to see if it’s specific to berlin or not.

My impression is that in my case, the problem is not with ssh, but somehow
with the guix daemon. I tend to observe this:

guix offload: sending 1 store item (6,490 MiB) to '141.80.167.169'...
exporting path `/gnu/store/y4ipvkapf1gninaabwdl6pcz46c1frak-texlive-texmf-20210325'

Then jnettop shows data transfer of about 100 to 120MB/s during a bit over
a minute, which makes me think that the actual data transfer succeeds, but
that somehow the data is not registered in the store.
Then nothing tangible happens and offloading times out. Since I am
apparently not able to connect to the machines via ssh, I cannot have a
closer look at the receiving end.

Andreas





Changed bug title to ''guix offload' occasionally hangs on berlin, hitting max-silent-time' from '[Cuirass] Build timeouts during evaluations' Request was from Ludovic Courtès <ludo@gnu.org> to control@debbugs.gnu.org. (Fri, 07 Apr 2023 15:50:02 GMT) (full text, mbox, link).


Information forwarded to bug-guix@gnu.org:
bug#61839; Package guix. (Mon, 17 Apr 2023 16:38:01 GMT) (full text, mbox, link).


Message #18 received at 61839@debbugs.gnu.org (full text, mbox, reply):

From: Ludovic Courtès <ludo@gnu.org>
To: 61839@debbugs.gnu.org
Cc: Mathieu Othacehe <othacehe@gnu.org>, Andreas Enge <andreas@enge.fr>
Subject: Re: bug#61839: 'guix offload' occasionally hangs on berlin, hitting max-silent-time
Date: Mon, 17 Apr 2023 18:37:36 +0200
Hi!

Ludovic Courtès <ludo@gnu.org> skribis:

> We often get evaluation errors for ‘core-updates’, where the build log
> reveals that this is due to a build timeout while building the Guix
> instance that will be used for the evaluation.  Example:

[...]

> exporting path `/gnu/store/m2hyx7qk2j1bsvpj1qv89701klv8kqkk-guix-command'
> exporting path `/gnu/store/vlm7401d5jvna6j7qdxhcr581cd8g8vh-guix-daemon-1.4.0-4.01fd830'
> exporting path `/gnu/store/zv3c0anfj085d2lyxh4g4aahhczmcaj8-guix-daemon'
> exporting path `/gnu/store/f26rw4haizf2h126k9ja9wk31cm49mj0-guix-17bd02433'
> exporting path `/gnu/store/wc736va72zl2zz36817ahf3kc5fkfr69-inferior-script.scm'
> building of `/gnu/store/yq9r44lic8cn4nphksbi5s3ppwfcsh1n-guile.drv' timed out after 3600 seconds of silence

With help from Andreas, I was able to reproduce it on berlin by running:

  guix build /gnu/store/b22580wj63wk4a3i09n22wa60y36kaa5-openjdk-16.0.2.drv

Timeout happens while ‘guix offload’ is sending build requisites to the
target machine.  At that point, the client ‘guix offload’ process is
stuck like so:

--8<---------------cut here---------------start------------->8---
(gdb) bt
#0  0x00007f1fd7ad7d6f in __GI___poll (fds=0x766b00, nfds=1, timeout=3594784) at ../sysdeps/unix/sysv/linux/poll.c:29
#1  0x00007f1fcc220de7 in ssh_poll_ctx_dopoll () from target:/gnu/store/f3hn75p2jjc94jfxfwsgf96s2wz8kaql-libssh-0.9.6/lib/libssh.so.4
#2  0x00007f1fcc221f0c in ssh_handle_packets () from target:/gnu/store/f3hn75p2jjc94jfxfwsgf96s2wz8kaql-libssh-0.9.6/lib/libssh.so.4
#3  0x00007f1fcc221fdb in ssh_handle_packets_termination ()
   from target:/gnu/store/f3hn75p2jjc94jfxfwsgf96s2wz8kaql-libssh-0.9.6/lib/libssh.so.4
#4  0x00007f1fcc2034bb in channel_write_common () from target:/gnu/store/f3hn75p2jjc94jfxfwsgf96s2wz8kaql-libssh-0.9.6/lib/libssh.so.4
#5  0x00007f1fcc28b15b in ?? () from target:/gnu/store/rjhq6cz7xbck4l7zar6ga3ygph2chwzr-guile-ssh-0.16.0/lib/libguile-ssh.so
#6  0x00007f1fd7cf4f34 in scm_i_write_bytes (port=0x7f1fc556ba00, src=0x7f1fc3c27000, start=0, count=65536) at ports.c:2858
#7  0x00007f1fd7cff546 in scm_put_bytevector (port=0x7f1fc556ba00, bv=0x7f1fc3c27000, start=<optimized out>, count=<optimized out>)
    at r6rs-ports.c:676
#8  0x00007f1fc5ac5197 in ?? ()
#9  0x000000000069731c in ?? ()
#10 0x00007f1fc5acdf71 in ?? ()
#11 0x0000000000697318 in ?? ()
#12 0x00007f1fd7cd2ccc in scm_jit_enter_mcode (thread=0x7f1fd73b1d80, mcode=0x69bf14 "\034;\004") at jit.c:6038
#13 0x00007f1fd7d2874f in vm_regular_engine (thread=0x7f1fd73b1d80) at vm-engine.c:546
#14 0x00007f1fd7d355d9 in scm_call_n (proc=<optimized out>, argv=<optimized out>, nargs=1) at vm.c:1610
#15 0x00007f1fd7c9d457 in scm_primitive_eval (exp=<optimized out>, exp@entry=0x7f1fcfab8cf0) at eval.c:671
#16 0x00007f1fd7ca34b6 in scm_eval (exp=0x7f1fcfab8cf0, module_or_state=0x7f1fcfaacc80) at eval.c:705
#17 0x00007f1fd7d073b6 in scm_shell (argc=11, argv=0x6a9fc0) at script.c:357
#18 0x00007f1fd7cb249c in invoke_main_func (body_data=0x7fffd2c0d370) at init.c:312
#19 0x00007f1fd7c9b85a in c_body (d=0x7fffd2c0d290) at continuations.c:430
#20 0x00007f1fd7d28326 in vm_regular_engine (thread=0x7f1fd73b1d80) at vm-engine.c:972
#21 0x00007f1fd7d355d9 in scm_call_n (proc=<optimized out>, argv=<optimized out>, nargs=2) at vm.c:1610
#22 0x00007f1fd7c9d09a in scm_call_2 (proc=<optimized out>, arg1=<optimized out>, arg2=<optimized out>) at eval.c:503
#23 0x00007f1fd7d53742 in scm_c_with_exception_handler.constprop.0 (type=0x404, handler_data=handler_data@entry=0x7fffd2c0d220,
    thunk_data=thunk_data@entry=0x7fffd2c0d220, thunk=<optimized out>, handler=<optimized out>) at exceptions.c:170
#24 0x00007f1fd7d2588f in scm_c_catch (tag=<optimized out>, body=<optimized out>, body_data=<optimized out>, handler=<optimized out>,
    handler_data=<optimized out>, pre_unwind_handler=<optimized out>, pre_unwind_handler_data=0x7f1fcfa10040) at throw.c:168
#25 0x00007f1fd7c9de66 in scm_i_with_continuation_barrier (pre_unwind_handler=0x7f1fd7c9db80 <pre_unwind_handler>,
    pre_unwind_handler_data=0x7f1fcfa10040, handler_data=0x7fffd2c0d290, handler=0x7f1fd7ca48b0 <c_handler>, body_data=0x7fffd2c0d290,
    body=0x7f1fd7c9b850 <c_body>) at continuations.c:368
#26 scm_c_with_continuation_barrier (func=<optimized out>, data=<optimized out>) at continuations.c:464
#27 0x00007f1fd7d24b39 in with_guile (base=0x7fffd2c0d318, data=0x7fffd2c0d340) at threads.c:645
#28 0x00007f1fd7bfb0ba in GC_call_with_stack_base () from target:/gnu/store/2lczkxbdbzh4gk7wh91bzrqrk7h5g1dl-libgc-8.0.4/lib/libgc.so.1
#29 0x00007f1fd7d1d8b8 in scm_i_with_guile (dynamic_state=<optimized out>, data=<optimized out>, func=<optimized out>) at threads.c:688
#30 scm_with_guile (func=<optimized out>, data=<optimized out>) at threads.c:694
#31 0x00007f1fd7cbb025 in scm_boot_guile (argc=<optimized out>, argv=<optimized out>, main_func=<optimized out>, closure=<optimized out>)
    at init.c:295
#32 0x0000000000401191 in ?? ()
#33 0x00007f1fd7a117dd in __libc_start_main (main=0x401110, argc=8, argv=0x7fffd2c0d4a8, init=<optimized out>, fini=<optimized out>,
    rtld_fini=<optimized out>, stack_end=0x7fffd2c0d498) at ../csu/libc-start.c:332
#34 0x00000000004011da in ?? ()
(gdb) p *fds
$1 = {fd = 15, events = 1, revents = 0}
(gdb) shell ls -l /proc/62345/fd/15
lrwx------ 1 root root 64 Apr 17 17:42 /proc/62345/fd/15 -> 'socket:[3238975699]'
--8<---------------cut here---------------end--------------->8---

The SSH connection is still live, and on the server side I see the ‘guix
repl -t machine’ process stuck in:

--8<---------------cut here---------------start------------->8---
write(2, "linking \342\200\230/gnu/store/0gbj38wswwxq50i68ci973fgi2l531b4-openjdk-16.0.2-checkout/test/hotspot/jtreg/compiler/c2/cr6340864/TestLongVect.java\342\200\231 to \342\200\230/gnu/store/.links/03zscrls9431zipkpx1sn450llvf5hkhvbkpjk57j22f7d2nb3vs\342\200\231\n", 221
--8<---------------cut here---------------end--------------->8---

(This message comes from optimise-store.cc:192.)

This suggests that ‘guix repl -t machine’ has filled its output buffer;
its output isn’t being read so it gets stuck in that write(2) call.  Its
parent ‘sshd’ process is stuck in:

--8<---------------cut here---------------start------------->8---
ppoll([{fd=3, events=POLLIN}, {fd=4, events=0}, {fd=10, events=POLLIN}, {fd=12, events=POLLIN}, {fd=11, events=POLLOUT}], 5, NULL, [], 8^C
--8<---------------cut here---------------end--------------->8---

One hypothesis is that when transferring store items with many files,
such as openjdk-checkout here, guix-daemon on the remote machine outputs
many such “linking … to …” messages; if the client doesn’t read them,
then we end up with a deadlock like this, with the server-side process
stuck writing more messages.

However, that “linking …” message is supposed to be written only at
non-default debugging levels:

    printMsg(lvlTalkative, format("linking ‘%1%’ to ‘%2%’") % path % linkPath);

Turns out that guix-daemon on build machines is running with ‘--debug’
starting with maintenance.git commit
72d8a62466b71eda657b52ca17e99f7189ea18ad (“sysadmin: build-machines:
Enable guix-daemon debug output.”)

The immediate fix is to undo that—I’ll do that later.

The other one is to swallow ‘current-error-port’ in
‘store-import-channel’.

To be continued…

Ludo’.




Information forwarded to bug-guix@gnu.org:
bug#61839; Package guix. (Mon, 17 Apr 2023 22:56:01 GMT) (full text, mbox, link).


Message #21 received at 61839@debbugs.gnu.org (full text, mbox, reply):

From: Ludovic Courtès <ludo@gnu.org>
To: 61839@debbugs.gnu.org
Cc: Mathieu Othacehe <othacehe@gnu.org>, Andreas Enge <andreas@enge.fr>
Subject: Re: bug#61839: 'guix offload' occasionally hangs on berlin, hitting max-silent-time
Date: Tue, 18 Apr 2023 00:55:22 +0200
Hi,

Ludovic Courtès <ludo@gnu.org> skribis:

> Turns out that guix-daemon on build machines is running with ‘--debug’
> starting with maintenance.git commit
> 72d8a62466b71eda657b52ca17e99f7189ea18ad (“sysadmin: build-machines:
> Enable guix-daemon debug output.”)
>
> The immediate fix is to undo that—I’ll do that later.

Done:

  https://git.savannah.gnu.org/cgit/guix/maintenance.git/commit/?id=cbe75edae348bf0aa235e93ebeb58fe9db14fa7a

I redeployed the build machines (on the same Guix commit as before).
The 4 machines that host a childhurd are still being redeployed though.

> The other one is to swallow ‘current-error-port’ in
> ‘store-import-channel’.

Done:

  https://git.savannah.gnu.org/cgit/guix.git/commit/?id=166a3e3fdeaa5279b2dd5000088f913c05af9558

Ludo’.




Send a report that this bug log contains spam.


debbugs.gnu.org maintainers <help-debbugs@gnu.org>. Last modified: Sun Sep 8 05:06:28 2024; Machine Name: wallace-server

GNU bug tracking system

Debbugs is free software and licensed under the terms of the GNU Public License version 2. The current version can be obtained from https://bugs.debian.org/debbugs-source/.

Copyright © 1999 Darren O. Benham, 1997,2003 nCipher Corporation Ltd, 1994-97 Ian Jackson, 2005-2017 Don Armstrong, and many other contributors.