GNU bug report logs

#31785 Multiple client 'build-paths' RPCs can lead to daemon deadlock

PackageSource(s)Maintainer(s)
guix PTS Buildd Popcon
Reply or subscribe to this bug. View this bug as an mbox, status mbox, or maintainer mbox

Report forwarded to bug-guix@gnu.org:
bug#31785; Package guix. (Mon, 11 Jun 2018 14:07:02 GMT) (full text, mbox, link).


Acknowledgement sent to ludo@gnu.org (Ludovic Courtès):
New bug report received and forwarded. Copy sent to bug-guix@gnu.org. (Mon, 11 Jun 2018 14:07:03 GMT) (full text, mbox, link).


Message #5 received at submit@debbugs.gnu.org (full text, mbox, reply):

From: ludo@gnu.org (Ludovic Courtès)
To: bug-guix@gnu.org
Subject: Multiple client 'build-paths' RPCs can lead to daemon deadlock
Date: Mon, 11 Jun 2018 16:06:16 +0200
Hello,

I tried running this:

  guix build --max-jobs=200 $(guix gc -R $(guix build -d inkscape --no-grafts) | sort) & \
  guix build --max-jobs=200 $(guix gc -R $(guix build -d inkscape --no-grafts) | sort -r)

… also in parallel with this (for good measure):

  guix build --max-jobs=200 $(guix gc -R $(guix build -d inkscape --no-grafts) | sort -R)

Since we have 3 clients, that leads to 3 guix-daemon processes, and
those are stuck in a deadlock:

--8<---------------cut here---------------start------------->8---
$ ps aux | grep guix-daemon
[…]
root     20501  0.6  0.0  39292 12012 ?        Ss   15:51   0:01 guix-daemon 20455
root     20503  4.3  0.0  39420 12096 ?        Ss   15:51   0:12 guix-daemon 20491
root     22154  0.0  0.0  39028 11016 ?        Ss   15:52   0:00 guix-daemon 22148
$ sudo strace -p 22154 -t
strace: Process 22154 attached
15:57:03 select(0, 0x7ffdca30b610, NULL, NULL, {tv_sec=1, tv_usec=314189}) = 0 (Timeout)
15:57:04 openat(AT_FDCWD, "/gnu/store/ghc9vcc8ydv305jz0spd73qrrf30gxik-binutils-bootstrap-0.lock", O_RDWR|O_CREAT, 0600) = 8
15:57:04 fcntl(8, F_GETFD)              = 0
15:57:04 fcntl(8, F_SETFD, FD_CLOEXEC)  = 0
15:57:04 fcntl(8, F_SETLK, {l_type=F_WRLCK, l_whence=SEEK_SET, l_start=0, l_len=0}) = -1 EAGAIN (Risurco dumtempe ne disponeblas)
15:57:04 close(8)                       = 0
15:57:04 openat(AT_FDCWD, "/gnu/store/5hdpffkgkfszpsin2ir50zscz06x0mb5-coreutils-8.29-debug.lock", O_RDWR|O_CREAT, 0600) = 8
15:57:04 fcntl(8, F_GETFD)              = 0
15:57:04 fcntl(8, F_SETFD, FD_CLOEXEC)  = 0
15:57:04 fcntl(8, F_SETLK, {l_type=F_WRLCK, l_whence=SEEK_SET, l_start=0, l_len=0}) = -1 EAGAIN (Risurco dumtempe ne disponeblas)
15:57:04 close(8)                       = 0
15:57:04 openat(AT_FDCWD, "/gnu/store/aasfcliv85frq3aawpa4rw18nvlvl735-gettext-minimal-0.19.8.1.lock", O_RDWR|O_CREAT, 0600) = 8
15:57:04 fcntl(8, F_GETFD)              = 0
15:57:04 fcntl(8, F_SETFD, FD_CLOEXEC)  = 0
15:57:04 fcntl(8, F_SETLK, {l_type=F_WRLCK, l_whence=SEEK_SET, l_start=0, l_len=0}) = -1 EAGAIN (Risurco dumtempe ne disponeblas)
15:57:04 close(8)                       = 0
15:57:04 openat(AT_FDCWD, "/gnu/store/j1kpzcpi1m1ihzwyqb60zd4s74yp85g8-glibc-bootstrap-0.lock", O_RDWR|O_CREAT, 0600) = 8
15:57:04 fcntl(8, F_GETFD)              = 0
15:57:04 fcntl(8, F_SETFD, FD_CLOEXEC)  = 0
15:57:04 fcntl(8, F_SETLK, {l_type=F_WRLCK, l_whence=SEEK_SET, l_start=0, l_len=0}) = -1 EAGAIN (Risurco dumtempe ne disponeblas)
15:57:04 close(8)                       = 0
15:57:04 openat(AT_FDCWD, "/gnu/store/4llnxsgr6mxxj6pk21d9vbwaxnkbn99k-gnutls-3.5.18-doc.lock", O_RDWR|O_CREAT, 0600) = 8
15:57:04 fcntl(8, F_GETFD)              = 0
15:57:04 fcntl(8, F_SETFD, FD_CLOEXEC)  = 0
15:57:04 fcntl(8, F_SETLK, {l_type=F_WRLCK, l_whence=SEEK_SET, l_start=0, l_len=0}) = -1 EAGAIN (Risurco dumtempe ne disponeblas)
15:57:04 close(8)                       = 0
15:57:04 openat(AT_FDCWD, "/gnu/store/7rs9qx86f2w4hjbf9wjm10k4m2vvlbyv-groff-1.22.3-doc.lock", O_RDWR|O_CREAT, 0600) = 8
15:57:04 fcntl(8, F_GETFD)              = 0
15:57:04 fcntl(8, F_SETFD, FD_CLOEXEC)  = 0
15:57:04 fcntl(8, F_SETLK, {l_type=F_WRLCK, l_whence=SEEK_SET, l_start=0, l_len=0}) = -1 EAGAIN (Risurco dumtempe ne disponeblas)
15:57:04 close(8)                       = 0
15:57:04 openat(AT_FDCWD, "/gnu/store/d645d919gp1zk8v65qa2r4ia5x1c5n6k-libstdc++-5.5.0.lock", O_RDWR|O_CREAT, 0600) = 8
15:57:04 fcntl(8, F_GETFD)              = 0
15:57:04 fcntl(8, F_SETFD, FD_CLOEXEC)  = 0
15:57:04 fcntl(8, F_SETLK, {l_type=F_WRLCK, l_whence=SEEK_SET, l_start=0, l_len=0}) = -1 EAGAIN (Risurco dumtempe ne disponeblas)
15:57:04 close(8)                       = 0
15:57:04 openat(AT_FDCWD, "/gnu/store/amcs7l0ynj1qg6fp9ll3asiamd4zsq75-m4-1.4.18.lock", O_RDWR|O_CREAT, 0600) = 8
15:57:04 fcntl(8, F_GETFD)              = 0
15:57:04 fcntl(8, F_SETFD, FD_CLOEXEC)  = 0
15:57:04 fcntl(8, F_SETLK, {l_type=F_WRLCK, l_whence=SEEK_SET, l_start=0, l_len=0}) = -1 EAGAIN (Risurco dumtempe ne disponeblas)
15:57:04 close(8)                       = 0
15:57:04 openat(AT_FDCWD, "/gnu/store/p262qk7zsd2wnlal6x1jax2xbqvdddqq-m4-1.4.18.lock", O_RDWR|O_CREAT, 0600) = 8
15:57:04 fcntl(8, F_GETFD)              = 0
15:57:04 fcntl(8, F_SETFD, FD_CLOEXEC)  = 0
15:57:04 fcntl(8, F_SETLK, {l_type=F_WRLCK, l_whence=SEEK_SET, l_start=0, l_len=0}) = -1 EAGAIN (Risurco dumtempe ne disponeblas)
15:57:04 close(8)                       = 0
15:57:04 openat(AT_FDCWD, "/gnu/store/zvvgr301hz95ac15rp0wsjy5j4cdgkg4-module-import-compiled.lock", O_RDWR|O_CREAT, 0600) = 8
15:57:04 fcntl(8, F_GETFD)              = 0
15:57:04 fcntl(8, F_SETFD, FD_CLOEXEC)  = 0
15:57:04 fcntl(8, F_SETLK, {l_type=F_WRLCK, l_whence=SEEK_SET, l_start=0, l_len=0}) = -1 EAGAIN (Risurco dumtempe ne disponeblas)
15:57:04 close(8)                       = 0
15:57:04 openat(AT_FDCWD, "/gnu/store/760kj698c9x1vcmlzk1qr8i9hqw961n3-openssl-1.0.2o-doc.lock", O_RDWR|O_CREAT, 0600) = 8
^C
$ sudo strace -p 20503 -t
strace: Process 20503 attached
15:58:23 select(445, [15 54 159 177 321 411 414 444], NULL, NULL, {tv_sec=1, tv_usec=595302}) = 0 (Timeout)
15:58:25 openat(AT_FDCWD, "/gnu/store/amcs7l0ynj1qg6fp9ll3asiamd4zsq75-m4-1.4.18.lock", O_RDWR|O_CREAT, 0600) = 14
15:58:25 fcntl(14, F_GETFD)             = 0
15:58:25 fcntl(14, F_SETFD, FD_CLOEXEC) = 0
15:58:25 fcntl(14, F_SETLK, {l_type=F_WRLCK, l_whence=SEEK_SET, l_start=0, l_len=0}) = -1 EAGAIN (Risurco dumtempe ne disponeblas)
15:58:25 close(14)                      = 0
15:58:25 openat(AT_FDCWD, "/gnu/store/zvvgr301hz95ac15rp0wsjy5j4cdgkg4-module-import-compiled.lock", O_RDWR|O_CREAT, 0600) = 14
15:58:25 fcntl(14, F_GETFD)             = 0
15:58:25 fcntl(14, F_SETFD, FD_CLOEXEC) = 0
15:58:25 fcntl(14, F_SETLK, {l_type=F_WRLCK, l_whence=SEEK_SET, l_start=0, l_len=0}) = -1 EAGAIN (Risurco dumtempe ne disponeblas)
15:58:25 close(14)                      = 0
15:58:25 openat(AT_FDCWD, "/gnu/store/760kj698c9x1vcmlzk1qr8i9hqw961n3-openssl-1.0.2o-doc.lock", O_RDWR|O_CREAT, 0600) = 14
15:58:25 fcntl(14, F_GETFD)             = 0
15:58:25 fcntl(14, F_SETFD, FD_CLOEXEC) = 0
15:58:25 fcntl(14, F_SETLK, {l_type=F_WRLCK, l_whence=SEEK_SET, l_start=0, l_len=0}) = -1 EAGAIN (Risurco dumtempe ne disponeblas)
15:58:25 close(14)                      = 0
15:58:25 select(445, [15 54 159 177 321 411 414 444], NULL, NULL, {tv_sec=5, tv_usec=0}) = 0 (Timeout)
15:58:30 openat(AT_FDCWD, "/gnu/store/amcs7l0ynj1qg6fp9ll3asiamd4zsq75-m4-1.4.18.lock", O_RDWR|O_CREAT, 0600) = 14
15:58:30 fcntl(14, F_GETFD)             = 0
15:58:30 fcntl(14, F_SETFD, FD_CLOEXEC) = 0
15:58:30 fcntl(14, F_SETLK, {l_type=F_WRLCK, l_whence=SEEK_SET, l_start=0, l_len=0}) = -1 EAGAIN (Risurco dumtempe ne disponeblas)
15:58:30 close(14)                      = 0
15:58:30 openat(AT_FDCWD, "/gnu/store/zvvgr301hz95ac15rp0wsjy5j4cdgkg4-module-import-compiled.lock", O_RDWR|O_CREAT, 0600) = 14
15:58:30 fcntl(14, F_GETFD)             = 0
15:58:30 fcntl(14, F_SETFD, FD_CLOEXEC) = 0
15:58:30 fcntl(14, F_SETLK, {l_type=F_WRLCK, l_whence=SEEK_SET, l_start=0, l_len=0}) = -1 EAGAIN (Risurco dumtempe ne disponeblas)
15:58:30 close(14)                      = 0
15:58:30 openat(AT_FDCWD, "/gnu/store/760kj698c9x1vcmlzk1qr8i9hqw961n3-openssl-1.0.2o-doc.lock", O_RDWR|O_CREAT, 0600) = 14
15:58:30 fcntl(14, F_GETFD)             = 0
15:58:30 fcntl(14, F_SETFD, FD_CLOEXEC) = 0
15:58:30 fcntl(14, F_SETLK, {l_type=F_WRLCK, l_whence=SEEK_SET, l_start=0, l_len=0}) = -1 EAGAIN (Risurco dumtempe ne disponeblas)
15:58:30 close(14)                      = 0
15:58:30 select(445, [15 54 159 177 321 411 414 444], NULL, NULL, {tv_sec=5, tv_usec=0}^Cstrace: Process 20503 detached
$ sudo strace -p 20501 -t
strace: Process 20501 attached
15:59:09 select(121, [20 30 120], NULL, NULL, {tv_sec=0, tv_usec=535316}) = 0 (Timeout)
15:59:10 openat(AT_FDCWD, "/gnu/store/ghc9vcc8ydv305jz0spd73qrrf30gxik-binutils-bootstrap-0.lock", O_RDWR|O_CREAT, 0600) = 8
15:59:10 fcntl(8, F_GETFD)              = 0
15:59:10 fcntl(8, F_SETFD, FD_CLOEXEC)  = 0
15:59:10 fcntl(8, F_SETLK, {l_type=F_WRLCK, l_whence=SEEK_SET, l_start=0, l_len=0}) = -1 EAGAIN (Risurco dumtempe ne disponeblas)
15:59:10 close(8)                       = 0
15:59:10 openat(AT_FDCWD, "/gnu/store/5hdpffkgkfszpsin2ir50zscz06x0mb5-coreutils-8.29-debug.lock", O_RDWR|O_CREAT, 0600) = 8
15:59:10 fcntl(8, F_GETFD)              = 0
15:59:10 fcntl(8, F_SETFD, FD_CLOEXEC)  = 0
15:59:10 fcntl(8, F_SETLK, {l_type=F_WRLCK, l_whence=SEEK_SET, l_start=0, l_len=0}) = -1 EAGAIN (Risurco dumtempe ne disponeblas)
15:59:10 close(8)                       = 0
15:59:10 openat(AT_FDCWD, "/gnu/store/aasfcliv85frq3aawpa4rw18nvlvl735-gettext-minimal-0.19.8.1.lock", O_RDWR|O_CREAT, 0600) = 8
15:59:10 fcntl(8, F_GETFD)              = 0
15:59:10 fcntl(8, F_SETFD, FD_CLOEXEC)  = 0
15:59:10 fcntl(8, F_SETLK, {l_type=F_WRLCK, l_whence=SEEK_SET, l_start=0, l_len=0}) = -1 EAGAIN (Risurco dumtempe ne disponeblas)
15:59:10 close(8)                       = 0
15:59:10 openat(AT_FDCWD, "/gnu/store/j1kpzcpi1m1ihzwyqb60zd4s74yp85g8-glibc-bootstrap-0.lock", O_RDWR|O_CREAT, 0600) = 8
15:59:10 fcntl(8, F_GETFD)              = 0
15:59:10 fcntl(8, F_SETFD, FD_CLOEXEC)  = 0
15:59:10 fcntl(8, F_SETLK, {l_type=F_WRLCK, l_whence=SEEK_SET, l_start=0, l_len=0}) = -1 EAGAIN (Risurco dumtempe ne disponeblas)
15:59:10 close(8)                       = 0
15:59:10 openat(AT_FDCWD, "/gnu/store/4llnxsgr6mxxj6pk21d9vbwaxnkbn99k-gnutls-3.5.18-doc.lock", O_RDWR|O_CREAT, 0600) = 8
15:59:10 fcntl(8, F_GETFD)              = 0
15:59:10 fcntl(8, F_SETFD, FD_CLOEXEC)  = 0
15:59:10 fcntl(8, F_SETLK, {l_type=F_WRLCK, l_whence=SEEK_SET, l_start=0, l_len=0}) = -1 EAGAIN (Risurco dumtempe ne disponeblas)
15:59:10 close(8)                       = 0
15:59:10 openat(AT_FDCWD, "/gnu/store/7rs9qx86f2w4hjbf9wjm10k4m2vvlbyv-groff-1.22.3-doc.lock", O_RDWR|O_CREAT, 0600) = 8
15:59:10 fcntl(8, F_GETFD)              = 0
15:59:10 fcntl(8, F_SETFD, FD_CLOEXEC)  = 0
15:59:10 fcntl(8, F_SETLK, {l_type=F_WRLCK, l_whence=SEEK_SET, l_start=0, l_len=0}) = -1 EAGAIN (Risurco dumtempe ne disponeblas)
15:59:10 close(8)                       = 0
15:59:10 openat(AT_FDCWD, "/gnu/store/d645d919gp1zk8v65qa2r4ia5x1c5n6k-libstdc++-5.5.0.lock", O_RDWR|O_CREAT, 0600) = 8
15:59:10 fcntl(8, F_GETFD)              = 0
15:59:10 fcntl(8, F_SETFD, FD_CLOEXEC)  = 0
15:59:10 fcntl(8, F_SETLK, {l_type=F_WRLCK, l_whence=SEEK_SET, l_start=0, l_len=0}) = -1 EAGAIN (Risurco dumtempe ne disponeblas)
15:59:10 close(8)                       = 0
15:59:10 openat(AT_FDCWD, "/gnu/store/p262qk7zsd2wnlal6x1jax2xbqvdddqq-m4-1.4.18.lock", O_RDWR|O_CREAT, 0600) = 8
15:59:10 fcntl(8, F_GETFD)              = 0
15:59:10 fcntl(8, F_SETFD, FD_CLOEXEC)  = 0
15:59:10 fcntl(8, F_SETLK, {l_type=F_WRLCK, l_whence=SEEK_SET, l_start=0, l_len=0}) = -1 EAGAIN (Risurco dumtempe ne disponeblas)
15:59:10 close(8)                       = 0
15:59:10 select(121, [20 30 120], NULL, NULL, {tv_sec=5, tv_usec=0}^Cstrace: Process 20501 detached
$ sudo lsof | grep '/gnu/store/.*\.lock'
guix-daem 20501             root   14uW     REG              253,0          0    3150879 /gnu/store/zvvgr301hz95ac15rp0wsjy5j4cdgkg4-module-import-compiled.lock
guix-daem 20501             root   26uW     REG              253,0          0    3212618 /gnu/store/760kj698c9x1vcmlzk1qr8i9hqw961n3-openssl-1.0.2o-doc.lock
guix-daem 20501             root  116uW     REG              253,0          0    3212684 /gnu/store/amcs7l0ynj1qg6fp9ll3asiamd4zsq75-m4-1.4.18.lock
guix-daem 20503             root    8uW     REG              253,0          0    3212520 /gnu/store/7rs9qx86f2w4hjbf9wjm10k4m2vvlbyv-groff-1.22.3-doc.lock
guix-daem 20503             root   50uW     REG              253,0          0    3212514 /gnu/store/ghc9vcc8ydv305jz0spd73qrrf30gxik-binutils-bootstrap-0.lock
guix-daem 20503             root  155uW     REG              253,0          0    3212547 /gnu/store/j1kpzcpi1m1ihzwyqb60zd4s74yp85g8-glibc-bootstrap-0.lock
guix-daem 20503             root  173uW     REG              253,0          0    3212552 /gnu/store/4llnxsgr6mxxj6pk21d9vbwaxnkbn99k-gnutls-3.5.18-doc.lock
guix-daem 20503             root  317uW     REG              253,0          0    3212602 /gnu/store/d645d919gp1zk8v65qa2r4ia5x1c5n6k-libstdc++-5.5.0.lock
guix-daem 20503             root  407uW     REG              253,0          0    3212655 /gnu/store/5hdpffkgkfszpsin2ir50zscz06x0mb5-coreutils-8.29-debug.lock
guix-daem 20503             root  410uW     REG              253,0          0    3212659 /gnu/store/aasfcliv85frq3aawpa4rw18nvlvl735-gettext-minimal-0.19.8.1.lock
guix-daem 20503             root  440uW     REG              253,0          0    3212686 /gnu/store/p262qk7zsd2wnlal6x1jax2xbqvdddqq-m4-1.4.18.lock
--8<---------------cut here---------------end--------------->8---

So for example:

  • 20501 tries to acquire
    aasfcliv85frq3aawpa4rw18nvlvl735-gettext-minimal-0.19.8.1.lock,
    which is held by 20503;

  • 20503 tries to acquire
    amcs7l0ynj1qg6fp9ll3asiamd4zsq75-m4-1.4.18.lock, which is held by
    20501.

This comes from the fact that ‘LocalStore::buildPaths’ takes the
user-supplied derivation list as is, without sorting it, and then
acquires locks in that order in ‘Worker::run’.

A topological sort (or maybe an alphanumeric sort?) should allow us to
guarantee that guix-daemon processes take locks in the same order, and
then don’t end up in a deadlock.

I discovered this bug while monitoring Cuirass on berlin: several
sessions submit batches of 200 derivations in ‘build-paths’ RPCs, and
sometimes most of the corresponding guix-daemon processes would end up
being stuck in a lock-acquiring loop.

Ludo’.




Severity set to 'important' from 'normal' Request was from ludo@gnu.org (Ludovic Courtès) to control@debbugs.gnu.org. (Mon, 11 Jun 2018 15:43:02 GMT) (full text, mbox, link).


Information forwarded to bug-guix@gnu.org:
bug#31785; Package guix. (Fri, 06 Sep 2019 09:06:01 GMT) (full text, mbox, link).


Message #10 received at 31785@debbugs.gnu.org (full text, mbox, reply):

From: Ludovic Courtès <ludo@gnu.org>
To: 31785@debbugs.gnu.org
Subject: Re: bug#31785: Multiple client 'build-paths' RPCs can lead to daemon deadlock
Date: Fri, 06 Sep 2019 11:04:43 +0200
Here’s another example from berlin today:

--8<---------------cut here---------------start------------->8---
ludo@berlin$ sudo guix processes
SessionPID: 38649
ClientPID: 38611
ClientCommand: /gnu/store/7x9lrzs2l00mfcr3ya7hg1i06i7lq9pd-profile/bin/guile \ /home/mbakke/guix/master/scripts/guix build --no-grafts --keep-going --max-silent-time=36000 --timeout=216000 -s armhf-linux nmoldyn r-msnid r-yamss netcdf-fortran python-netcdf4
LockHeld: /gnu/store/v6hcm1gvv3gdfbnq6cd4kn1i6ip3y74s-netcdf-fortran-4.4.4.lock
LockHeld: /gnu/store/ydxl0ws9kk6ix58m99ib7qhi157yq49h-r-msnid-1.18.1.lock
ChildProcess: 33181: /gnu/store/9alic3caqhay3h8mx4iihpmyj6ymqpcx-guile-2.2.4/bin/guile \ /gnu/store/vyhfp0gj4rx74yw0ybalzj8bvmqgp9pr-guix-1.0.1-4.c902458/bin/.guix-real offload x86_64-linux 36000 1 216000

[...]

SessionPID: 44043
ClientPID: 44023
ClientCommand: /gnu/store/7x9lrzs2l00mfcr3ya7hg1i06i7lq9pd-profile/bin/guile \ /home/mbakke/guix/master/scripts/guix build --no-grafts --keep-going --max-silent-time=36000 --timeout=216000 -s armhf-linux netcdf-fortran r-msnid nmoldyn r-yamss
LockHeld: /gnu/store/69jjabypcpx83482rh3yp4a9ncjimw6b-r-yamss-1.10.0.lock
LockHeld: /gnu/store/ibdbwxb1i8yal8ilqxxgnzh6bxpinic1-nmoldyn-3.0.11.lock
ChildProcess: 48997: /gnu/store/9alic3caqhay3h8mx4iihpmyj6ymqpcx-guile-2.2.4/bin/guile \ /gnu/store/vyhfp0gj4rx74yw0ybalzj8bvmqgp9pr-guix-1.0.1-4.c902458/bin/.guix-real offload x86_64-linux 36000 1 216000

ludo@berlin ~$ sudo strace -p 38649 -e openat,fcntl
strace: Process 38649 attached
openat(AT_FDCWD, "/gnu/store/ibdbwxb1i8yal8ilqxxgnzh6bxpinic1-nmoldyn-3.0.11.lock", O_RDWR|O_CREAT, 0600) = 10
fcntl(10, F_GETFD)                      = 0
fcntl(10, F_SETFD, FD_CLOEXEC)          = 0
fcntl(10, F_SETLK, {l_type=F_WRLCK, l_whence=SEEK_SET, l_start=0, l_len=0}) = -1 EAGAIN (Resource temporarily unavailable)
openat(AT_FDCWD, "/gnu/store/69jjabypcpx83482rh3yp4a9ncjimw6b-r-yamss-1.10.0.lock", O_RDWR|O_CREAT, 0600) = 10
fcntl(10, F_GETFD)                      = 0
fcntl(10, F_SETFD, FD_CLOEXEC)          = 0
fcntl(10, F_SETLK, {l_type=F_WRLCK, l_whence=SEEK_SET, l_start=0, l_len=0}) = -1 EAGAIN (Resource temporarily unavailable)
^Cstrace: Process 38649 detached

ludo@berlin ~$ sudo strace -p 44043 -e openat,fcntl
strace: Process 44043 attached
openat(AT_FDCWD, "/gnu/store/ydxl0ws9kk6ix58m99ib7qhi157yq49h-r-msnid-1.18.1.lock", O_RDWR|O_CREAT, 0600) = 10
fcntl(10, F_GETFD)                      = 0
fcntl(10, F_SETFD, FD_CLOEXEC)          = 0
fcntl(10, F_SETLK, {l_type=F_WRLCK, l_whence=SEEK_SET, l_start=0, l_len=0}) = -1 EAGAIN (Resource temporarily unavailable)
^Cstrace: Process 44043 detached
--8<---------------cut here---------------end--------------->8---

A very palpable deadlock.

Ludo’.




Information forwarded to bug-guix@gnu.org:
bug#31785; Package guix. (Wed, 04 Nov 2020 15:22:02 GMT) (full text, mbox, link).


Message #13 received at 31785@debbugs.gnu.org (full text, mbox, reply):

From: Ludovic Courtès <ludo@gnu.org>
To: 31785@debbugs.gnu.org
Cc: Mathieu Othacehe <othacehe@gnu.org>
Subject: Re: bug#31785: Multiple client 'build-paths' RPCs can lead to daemon deadlock
Date: Wed, 04 Nov 2020 16:21:39 +0100
Hi,

ludo@gnu.org (Ludovic Courtès) skribis:

> This comes from the fact that ‘LocalStore::buildPaths’ takes the
> user-supplied derivation list as is, without sorting it, and then
> acquires locks in that order in ‘Worker::run’.

This diagnostic is incorrect: ‘Goals’ is a set sorted according to
‘CompareGoalPtrs’, which is lexical sort that arranges so substitution
goals come before derivation goals.  Thus, ‘_topGoals’ and ‘awake2’ in
Worker::run are sorted in a deterministic fashion.


The problem is that ‘Worker::waitForAWhile’ reshuffles the order of
goals by temporarily moving goals out of the way.  This can happen when
offloading replies “postpone”, which is inherently non-deterministic
(which goals are put to sleep will vary from one session to another
session.)

When those goals are eventually woken up from ‘Worker::waitForInput’,
they’re reprocessed, in sorted order, but potentially with “holes”
compared to other ‘guix-daemon’ processes.

That’s only a partial explanation; we need to go further to come up with
an actual deadlock scenario.

Ludo’.




Send a report that this bug log contains spam.


debbugs.gnu.org maintainers <help-debbugs@gnu.org>. Last modified: Sun Sep 8 03:30:24 2024; Machine Name: wallace-server

GNU bug tracking system

Debbugs is free software and licensed under the terms of the GNU Public License version 2. The current version can be obtained from https://bugs.debian.org/debbugs-source/.

Copyright © 1999 Darren O. Benham, 1997,2003 nCipher Corporation Ltd, 1994-97 Ian Jackson, 2005-2017 Don Armstrong, and many other contributors.