GNU bug report logs

#54447 cuirass: missing derivation error

PackageSource(s)Maintainer(s)
guix PTS Buildd Popcon
Reply or subscribe to this bug. View this bug as an mbox, status mbox, or maintainer mbox

Report forwarded to bug-guix@gnu.org:
bug#54447; Package guix. (Fri, 18 Mar 2022 12:37:02 GMT) (full text, mbox, link).


Acknowledgement sent to Mathieu Othacehe <othacehe@gnu.org>:
New bug report received and forwarded. Copy sent to bug-guix@gnu.org. (Fri, 18 Mar 2022 12:37:02 GMT) (full text, mbox, link).


Message #5 received at submit@debbugs.gnu.org (full text, mbox, reply):

From: Mathieu Othacehe <othacehe@gnu.org>
To: bug-guix@gnu.org
Subject: cuirass: missing derivation error
Date: Fri, 18 Mar 2022 13:36:56 +0100
Hello,

A lot of builds, among them ~20 system tests[1], are failing with:
"cannot build missing derivation
?/gnu/store/hs6kp1lqgymhyp3jndc0dsp0pn4psgv0-gui-installed-desktop-os-encrypted.drv?"
errors.

Those derivations are present on the CI head node. This means that the
errors occur during substitution. This is most likely caused by some
issue with the publish server, because:

- The publish server serves a 404 error. We should get rid once and for
  all of this 404 thing, pushing something like:
  https://issues.guix.gnu.org/50040.

or

- The publish server is not fast enough and hits an Nginx timeout that
  closes the communication.

Any other cause I could be missing?

Thanks,

Mathieu

[1]: https://ci.guix.gnu.org/eval/159975?status=failed




Information forwarded to bug-guix@gnu.org:
bug#54447; Package guix. (Wed, 10 Aug 2022 09:44:02 GMT) (full text, mbox, link).


Message #8 received at 54447@debbugs.gnu.org (full text, mbox, reply):

From: Maxime Devos <maximedevos@telenet.be>
To: 54447@debbugs.gnu.org
Subject: cuirass: missing derivation error
Date: Wed, 10 Aug 2022 11:43:33 +0200
[Message part 1 (text/plain, inline)]
Here's another instance: https://ci.guix.gnu.org/eval/528710

[OpenPGP_0x49E3EE22191725EE.asc (application/pgp-keys, attachment)]
[OpenPGP_signature (application/pgp-signature, attachment)]

Information forwarded to bug-guix@gnu.org:
bug#54447; Package guix. (Wed, 10 Aug 2022 15:31:01 GMT) (full text, mbox, link).


Message #11 received at 54447@debbugs.gnu.org (full text, mbox, reply):

From: Maxime Devos <maximedevos@telenet.be>
To: 54447@debbugs.gnu.org
Subject: Re: cuirass: missing derivation error
Date: Wed, 10 Aug 2022 17:30:37 +0200
[Message part 1 (text/plain, inline)]
On 10-08-2022 11:43, Maxime Devos wrote:
> Here's another instance: https://ci.guix.gnu.org/eval/528710
>
More information:

 * non-ASCII does not seem to be set up (see: ?) (looks irrelevant)
 * here are connection failures

Log:

> substitute:
> substitute: updating substitutes from 'http://141.80.167.131'...   0.0%guix substitute: warning: 141.80.167.131: connection failed: Connection refused
> substitute:
> cannot build missing derivation ?/gnu/store/4gqj2byvj9zz30wzvwkbijpya3vn1bjw-rust-dogged-0.2.0.drv?

Greetings,
Maxime.
[Message part 2 (text/html, inline)]
[OpenPGP_0x49E3EE22191725EE.asc (application/pgp-keys, attachment)]
[OpenPGP_signature (application/pgp-signature, attachment)]

Information forwarded to bug-guix@gnu.org:
bug#54447; Package guix. (Sat, 10 Dec 2022 10:58:01 GMT) (full text, mbox, link).


Message #14 received at 54447@debbugs.gnu.org (full text, mbox, reply):

From: Ludovic Courtès <ludo@gnu.org>
To: 54447@debbugs.gnu.org
Subject: Re: bug#54447: cuirass: missing derivation error
Date: Sat, 10 Dec 2022 11:57:38 +0100
Mathieu Othacehe <othacehe@gnu.org> skribis:

> A lot of builds, among them ~20 system tests[1], are failing with:
> "cannot build missing derivation
> ?/gnu/store/hs6kp1lqgymhyp3jndc0dsp0pn4psgv0-gui-installed-desktop-os-encrypted.drv?"
> errors.
>
> Those derivations are present on the CI head node. This means that the
> errors occur during substitution. This is most likely caused by some
> issue with the publish server, because:
>
> - The publish server serves a 404 error. We should get rid once and for
>   all of this 404 thing, pushing something like:
>   https://issues.guix.gnu.org/50040.
>
> or
>
> - The publish server is not fast enough and hits an Nginx timeout that
>   closes the communication.

Also being discussed at <https://issues.guix.gnu.org/48468#12>.

Ludo’.




Severity set to 'important' from 'normal' Request was from Ludovic Courtès <ludo@gnu.org> to control@debbugs.gnu.org. (Sat, 10 Dec 2022 10:58:02 GMT) (full text, mbox, link).


Information forwarded to bug-guix@gnu.org:
bug#54447; Package guix. (Tue, 22 Aug 2023 03:39:02 GMT) (full text, mbox, link).


Message #19 received at 54447@debbugs.gnu.org (full text, mbox, reply):

From: Maxim Cournoyer <maxim.cournoyer@gmail.com>
To: Mathieu Othacehe <othacehe@gnu.org>
Cc: 54447@debbugs.gnu.org
Subject: Re: bug#54447: cuirass: missing derivation error
Date: Mon, 21 Aug 2023 23:38:41 -0400
Hello,

Mathieu Othacehe <othacehe@gnu.org> writes:

> Hello,
>
> A lot of builds, among them ~20 system tests[1], are failing with:
> "cannot build missing derivation
> ?/gnu/store/hs6kp1lqgymhyp3jndc0dsp0pn4psgv0-gui-installed-desktop-os-encrypted.drv?"
> errors.
>
> Those derivations are present on the CI head node. This means that the
> errors occur during substitution. This is most likely caused by some
> issue with the publish server, because:
>
> - The publish server serves a 404 error. We should get rid once and for
>   all of this 404 thing, pushing something like:
>   https://issues.guix.gnu.org/50040.
>
> or
>
> - The publish server is not fast enough and hits an Nginx timeout that
>   closes the communication.
>
> Any other cause I could be missing?

Looking at multiple of recent 'cannot build missing derivation' build
failures on Cuirass, I see for example:

--8<---------------cut here---------------start------------->8---
substitute: 
substitute: [Kupdating substitutes from 'http://141.80.167.131'...   0.0%
substitute: [Kcould not fetch http://141.80.167.131/rhgrs3ac6h64siz0krqh2ia8kkn3h6ym.narinfo 504
substitute: updating substitutes from 'http://141.80.167.131'... 100.0%
cannot build missing derivation ?/gnu/store/rhgrs3ac6h64siz0krqh2ia8kkn3h6ym-python-asdf-standard-1.0.3.drv?
--8<---------------cut here---------------end--------------->8---

So it seems the error originated from guix-publish being too heavily
under load to produce a timely reply, and the nginx proxy issued a 504
(timeout) error response.

Looking into /var/log/guix-publish.log for a corresponding entry, I
found:

--8<---------------cut here---------------start------------->8---
2023-08-21 23:59:35 GET /rhgrs3ac6h64siz0krqh2ia8kkn3h6ym.narinfo
2023-08-21 23:59:35 In web/server/http.scm:
2023-08-21 23:59:35     159:7  2 (http-write #<<http-server> socket: #<input-output: fi…> …)
2023-08-21 23:59:35 In unknown file:
2023-08-21 23:59:35            1 (put-bytevector #<input-output: socket 42> #vu8(83 # …) …)
2023-08-21 23:59:35 In ice-9/boot-9.scm:
2023-08-21 23:59:35   1685:16  0 (raise-exception _ #:continuable? _)
2023-08-21 23:59:35 In procedure fport_write: Broken pipe
--8<---------------cut here---------------end--------------->8---

So the connection was apparently severed (?), resulting in the "broken
pipe" error.

Here's a different one:

--8<---------------cut here---------------start------------->8---
substitute: 
substitute: [Kupdating substitutes from 'http://141.80.167.131'...   0.0%
substitute: [Kcould not fetch http://141.80.167.131/p2lfyvbxicjqsm4qp6368bx76gp0g948.narinfo 504
substitute: updating substitutes from 'http://141.80.167.131'... 100.0%
cannot build missing derivation ?/gnu/store/p2lfyvbxicjqsm4qp6368bx76gp0g948-python-astropy-healpix-0.7.drv?
--8<---------------cut here---------------end--------------->8---

it occurred around the same time, and the failing mode was the same, per
guix-publish.log:

--8<---------------cut here---------------start------------->8---
2023-08-21 23:59:35 GET /p2lfyvbxicjqsm4qp6368bx76gp0g948.narinfo
2023-08-21 23:59:35 In web/server/http.scm:
2023-08-21 23:59:35     159:7  2 (http-write #<<http-server> socket: #<input-output: fi…> …)
2023-08-21 23:59:35 In unknown file:
2023-08-21 23:59:35            1 (put-bytevector #<input-output: socket 50> #vu8(83 # …) …)
2023-08-21 23:59:35 In ice-9/boot-9.scm:
2023-08-21 23:59:35   1685:16  0 (raise-exception _ #:continuable? _)
2023-08-21 23:59:35 In procedure fport_write: Broken pipe
--8<---------------cut here---------------end--------------->8---

I wonder if these could be related to the DDoS protection discovered on
the Berlin network.  I'll keep looking for other, potentially different
occurrences.

-- 
Thanks,
Maxim




Information forwarded to bug-guix@gnu.org:
bug#54447; Package guix. (Tue, 22 Aug 2023 20:39:02 GMT) (full text, mbox, link).


Message #22 received at 54447@debbugs.gnu.org (full text, mbox, reply):

From: Ludovic Courtès <ludo@gnu.org>
To: Maxim Cournoyer <maxim.cournoyer@gmail.com>
Cc: Mathieu Othacehe <othacehe@gnu.org>, 54447@debbugs.gnu.org
Subject: Re: bug#54447: cuirass: missing derivation error
Date: Tue, 22 Aug 2023 22:38:24 +0200
Hi,

Maxim Cournoyer <maxim.cournoyer@gmail.com> skribis:

> Looking at multiple of recent 'cannot build missing derivation' build
> failures on Cuirass, I see for example:
>
> substitute: 
> substitute: [Kupdating substitutes from 'http://141.80.167.131'...   0.0%
> substitute: [Kcould not fetch http://141.80.167.131/rhgrs3ac6h64siz0krqh2ia8kkn3h6ym.narinfo 504
> substitute: updating substitutes from 'http://141.80.167.131'... 100.0%
> cannot build missing derivation ?/gnu/store/rhgrs3ac6h64siz0krqh2ia8kkn3h6ym-python-asdf-standard-1.0.3.drv?
>
>
> So it seems the error originated from guix-publish being too heavily
> under load to produce a timely reply, and the nginx proxy issued a 504
> (timeout) error response.
>
> Looking into /var/log/guix-publish.log for a corresponding entry, I
> found:
>
> 2023-08-21 23:59:35 GET /rhgrs3ac6h64siz0krqh2ia8kkn3h6ym.narinfo
> 2023-08-21 23:59:35 In web/server/http.scm:
> 2023-08-21 23:59:35     159:7  2 (http-write #<<http-server> socket: #<input-output: fi…> …)
> 2023-08-21 23:59:35 In unknown file:
> 2023-08-21 23:59:35            1 (put-bytevector #<input-output: socket 42> #vu8(83 # …) …)
> 2023-08-21 23:59:35 In ice-9/boot-9.scm:
> 2023-08-21 23:59:35   1685:16  0 (raise-exception _ #:continuable? _)
> 2023-08-21 23:59:35 In procedure fport_write: Broken pipe
>
>
> So the connection was apparently severed (?), resulting in the "broken
> pipe" error.

I think it’s just that, when ‘guix publish’ eventually replied, the
client had left, hence EPIPE.

The initial problem does look like ‘guix publish’ being too slow.  Do
the corresponding nginx logs confirm the “backend too slow => 504”
hypothesis?

Thanks for investigating!

Ludo’.




Information forwarded to bug-guix@gnu.org:
bug#54447; Package guix. (Wed, 30 Aug 2023 12:18:02 GMT) (full text, mbox, link).


Message #25 received at 54447@debbugs.gnu.org (full text, mbox, reply):

From: 宋文武 <iyzsong@envs.net>
To: Maxim Cournoyer <maxim.cournoyer@gmail.com>
Cc: Mathieu Othacehe <othacehe@gnu.org>, 54447@debbugs.gnu.org
Subject: Re: bug#54447: cuirass: missing derivation error
Date: Wed, 30 Aug 2023 20:17:20 +0800
Maxim Cournoyer <maxim.cournoyer@gmail.com> writes:

> I wonder if these could be related to the DDoS protection discovered on
> the Berlin network.  I'll keep looking for other, potentially different
> occurrences.


Hello, this one for ddd: https://ci.guix.gnu.org/build/1372655/log/raw

  cannot build missing derivation ?/gnu/store/anzz2p18b7r9x45y350avnk8br2yihi2-ddd-3.4.0.drv?

Restart it on CI still got the same error.




Information forwarded to bug-guix@gnu.org:
bug#54447; Package guix. (Tue, 10 Oct 2023 15:54:01 GMT) (full text, mbox, link).


Message #28 received at 54447@debbugs.gnu.org (full text, mbox, reply):

From: Ludovic Courtès <ludo@gnu.org>
To: Mathieu Othacehe <othacehe@gnu.org>
Cc: 54447@debbugs.gnu.org, guix-sysadmin <guix-sysadmin@gnu.org>
Subject: Re: bug#54447: cuirass: missing derivation error
Date: Tue, 10 Oct 2023 17:52:54 +0200
[Message part 1 (text/plain, inline)]
Hello!

Mathieu Othacehe <othacehe@gnu.org> skribis:

> A lot of builds, among them ~20 system tests[1], are failing with:
> "cannot build missing derivation
> ?/gnu/store/hs6kp1lqgymhyp3jndc0dsp0pn4psgv0-gui-installed-desktop-os-encrypted.drv?"
> errors.

I have a disappointingly simple hypothesis for this.  Remember that
“missing derivation” errors happen primarily for system tests.

Turns out that ‘cleanup-cuirass-roots’ in maintenance.git, used as an
mcron job, explicitly removes GC roots for things like *-os-encrypted
once they’re more than two days old, as well as GC roots for the
corresponding .drv.

I think this was increasing the likelihood that a .drv would be GC’d by
the time we run the test: under high load¹, it’s plausible that a system
test wouldn’t be built within two days after it’s been queued.

I’m proposing the change below to address this; I don’t think we need
‘--gc-keep-outputs --gc-keep-derivations’ anymore now that we keep
things in ‘guix publish’ cache first and foremost.

Thoughts?

In addition to the mcron job, Cuirass’s own ‘register-gc-roots’
procedure periodically deletes GC roots older than ‘%gc-roots-ttl’ (30
days in practice).  That’s okay, except that it would be safer to delete
GC roots for a .drv if and only if it’s been built already.

Thanks,
Ludo’.

¹ The queue was often processed slowly, with many workers remaining idle
  due to the bug fixed by
  <https://git.savannah.gnu.org/cgit/guix/guix-cuirass.git/commit/?id=40f70d28aed55c404cca6a0760860fb4942e6bee>.

[Message part 2 (text/x-patch, inline)]
diff --git a/hydra/modules/sysadmin/services.scm b/hydra/modules/sysadmin/services.scm
index fecfdde..e6f2b44 100644
--- a/hydra/modules/sysadmin/services.scm
+++ b/hydra/modules/sysadmin/services.scm
@@ -110,9 +110,7 @@
                               ((guix config) => ,(make-config.scm)))
        #~(begin
            (use-modules (ice-9 ftw)
-                        (srfi srfi-1)
-                        (guix store)
-                        (guix derivations))
+                        (srfi srfi-1))
 
            (define %roots-directory
              "/var/guix/profiles/per-user/cuirass/cuirass")
@@ -157,28 +155,6 @@
                      deleted))
                  deleted))
 
-           (define (root-target root)
-             ;; Return the store item ROOT refers to.
-             (string-append (%store-prefix) "/" (basename root)))
-
-           (define (derivation-referrers store item)
-             ;; Return the referrers of the derivers of ITEM.
-             (let* ((derivers  (valid-derivers store item))
-                    (referrers (append-map (lambda (drv)
-                                             (referrers store drv))
-                                           derivers)))
-               (delete-duplicates referrers)))
-
-           (define (delete-gc-root-for-derivation drv)
-             ;; Delete the GC root for DRV, if any.
-             (catch 'system-error
-               (lambda ()
-                 (let ((item (derivation-path->output-path drv)))
-                   (delete-file
-                    (string-append %roots-directory
-                                   "/" (basename drv)))))
-               (const #f)))
-
            ;; Note: 'scandir' would introduce too much overhead due
            ;; to the large number of entries that it would sort.
            (define deleted
@@ -197,17 +173,7 @@
                (for-each (lambda (file)
                            (display file port)
                            (newline port))
-                         deleted)))
-
-           ;; Since we run 'guix-daemon --gc-keep-outputs
-           ;; --gc-keep-derivations', also remove GC roots for the outputs of
-           ;; derivations that refer to the derivers of DELETED.
-           (for-each delete-gc-root-for-derivation
-                     (with-store store
-                       (append-map (lambda (root)
-                                     (derivation-referrers
-                                      store (root-target root)))
-                                   deleted))))))))
+                         deleted))))))))
 
 (define (gc-jobs threshold)
   "Return the garbage collection mcron jobs.  The garbage collection
@@ -251,8 +217,7 @@ collection instead."
 
    (build-accounts (* build-accounts-to-max-jobs-ratio max-jobs))
    (extra-options (list "--max-jobs" (number->string max-jobs)
-                        "--cores" (number->string cores)
-                        "--gc-keep-outputs" "--gc-keep-derivations"))))
+                        "--cores" (number->string cores)))))
 
 
 ;;;

Information forwarded to bug-guix@gnu.org:
bug#54447; Package guix. (Wed, 11 Oct 2023 03:09:01 GMT) (full text, mbox, link).


Message #31 received at 54447@debbugs.gnu.org (full text, mbox, reply):

From: Maxim Cournoyer <maxim.cournoyer@gmail.com>
To: Ludovic Courtès <ludo@gnu.org>
Cc: Mathieu Othacehe <othacehe@gnu.org>, guix-sysadmin <guix-sysadmin@gnu.org>, 54447@debbugs.gnu.org
Subject: Re: bug#54447: cuirass: missing derivation error
Date: Tue, 10 Oct 2023 23:08:12 -0400
Hi Ludovic,

Ludovic Courtès <ludo@gnu.org> writes:

> Hello!
>
> Mathieu Othacehe <othacehe@gnu.org> skribis:
>
>> A lot of builds, among them ~20 system tests[1], are failing with:
>> "cannot build missing derivation
>> ?/gnu/store/hs6kp1lqgymhyp3jndc0dsp0pn4psgv0-gui-installed-desktop-os-encrypted.drv?"
>> errors.
>
> I have a disappointingly simple hypothesis for this.  Remember that
> “missing derivation” errors happen primarily for system tests.
>
> Turns out that ‘cleanup-cuirass-roots’ in maintenance.git, used as an
> mcron job, explicitly removes GC roots for things like *-os-encrypted
> once they’re more than two days old, as well as GC roots for the
> corresponding .drv.
>
> I think this was increasing the likelihood that a .drv would be GC’d by
> the time we run the test: under high load¹, it’s plausible that a system
> test wouldn’t be built within two days after it’s been queued.
>
> I’m proposing the change below to address this; I don’t think we need
> ‘--gc-keep-outputs --gc-keep-derivations’ anymore now that we keep
> things in ‘guix publish’ cache first and foremost.
>
> Thoughts?

Ah, so that mcron job is kind of a hack to hasten garbage collecting
only *some* items faster than the default policy of 30 days?  And we'd
now avoid deleting selected .drv files while still deleting their
outputs, so in the case something that needs it took more than 2 days to
build, it could lead to having to rebuild the garbage collected outputs?

I'm not sure if we need such a fancy hack with the 100 TiB of data we
now have, but your fix seems reasonable (LGTM!)

> In addition to the mcron job, Cuirass’s own ‘register-gc-roots’
> procedure periodically deletes GC roots older than ‘%gc-roots-ttl’ (30
> days in practice).  That’s okay, except that it would be safer to delete
> GC roots for a .drv if and only if it’s been built already.

Hm.  I wonder if this could explain the other cases we've seen.  It
could be that building a derivation was interrupted or canceled for some
reason, then 30 days elapsed, then was garbage collected, and after
which it doesn't get recreated and we get the error of the missing .drv?

-- 
Thanks,
Maxim




Information forwarded to bug-guix@gnu.org:
bug#54447; Package guix. (Wed, 11 Oct 2023 03:23:02 GMT) (full text, mbox, link).


Message #34 received at 54447@debbugs.gnu.org (full text, mbox, reply):

From: Maxim Cournoyer <maxim.cournoyer@gmail.com>
To: 宋文武 <iyzsong@envs.net>
Cc: Mathieu Othacehe <othacehe@gnu.org>, Ludovic Courtès <ludo@gnu.org>, 54447@debbugs.gnu.org
Subject: Re: bug#54447: cuirass: missing derivation error
Date: Tue, 10 Oct 2023 23:21:49 -0400
Hello,

宋文武 <iyzsong@envs.net> writes:

[...]

> Hello, this one for ddd: https://ci.guix.gnu.org/build/1372655/log/raw
>
>   cannot build missing derivation ?/gnu/store/anzz2p18b7r9x45y350avnk8br2yihi2-ddd-3.4.0.drv?
>
> Restart it on CI still got the same error.

Another example: https://ci.guix.gnu.org/build/1982454/details

--8<---------------cut here---------------start------------->8---
substitute: 
substitute: [Kupdating substitutes from 'http://10.0.0.1'...   0.0%
substitute: [Kupdating substitutes from 'http://10.0.0.1'... 100.0%
cannot build missing derivation ?/gnu/store/vwhgs9dkj9spryglb180j27dr5vidjxv-ecl-23.9.9.drv?
--8<---------------cut here---------------end--------------->8---

-- 
Thanks,
Maxim




Information forwarded to bug-guix@gnu.org:
bug#54447; Package guix. (Sun, 15 Oct 2023 16:47:01 GMT) (full text, mbox, link).


Message #37 received at 54447@debbugs.gnu.org (full text, mbox, reply):

From: Ludovic Courtès <ludo@gnu.org>
To: Maxim Cournoyer <maxim.cournoyer@gmail.com>
Cc: Mathieu Othacehe <othacehe@gnu.org>, 宋文武 <iyzsong@envs.net>, 54447@debbugs.gnu.org
Subject: Re: bug#54447: cuirass: missing derivation error
Date: Sun, 15 Oct 2023 18:45:37 +0200
Maxim Cournoyer <maxim.cournoyer@gmail.com> skribis:

> Another example: https://ci.guix.gnu.org/build/1982454/details
>
> substitute: 
> substitute: [Kupdating substitutes from 'http://10.0.0.1'...   0.0%
> substitute: [Kupdating substitutes from 'http://10.0.0.1'... 100.0%
> cannot build missing derivation ?/gnu/store/vwhgs9dkj9spryglb180j27dr5vidjxv-ecl-23.9.9.drv?

This one is from Sep. 9, which is before I deployed the remote-worker
fixes, so I’ll dismiss it (happy to look at more recent ones though!).

Tip of the day: M-: (build-farm-build 1982454)

Ludo’.




Information forwarded to bug-guix@gnu.org:
bug#54447; Package guix. (Sun, 15 Oct 2023 20:23:01 GMT) (full text, mbox, link).


Message #40 received at 54447@debbugs.gnu.org (full text, mbox, reply):

From: Ludovic Courtès <ludo@gnu.org>
To: 54447@debbugs.gnu.org
Subject: Re: bug#54447: cuirass: missing derivation error
Date: Sun, 15 Oct 2023 22:21:58 +0200
Hi!

Ludovic Courtès <ludo@gnu.org> skribis:

> Mathieu Othacehe <othacehe@gnu.org> skribis:
>
>> A lot of builds, among them ~20 system tests[1], are failing with:
>> "cannot build missing derivation
>> ?/gnu/store/hs6kp1lqgymhyp3jndc0dsp0pn4psgv0-gui-installed-desktop-os-encrypted.drv?"
>> errors.
>>
>> Those derivations are present on the CI head node. This means that the
>> errors occur during substitution. This is most likely caused by some
>> issue with the publish server, because:
>>
>> - The publish server serves a 404 error. We should get rid once and for
>>   all of this 404 thing, pushing something like:
>>   https://issues.guix.gnu.org/50040.
>>
>> or
>>
>> - The publish server is not fast enough and hits an Nginx timeout that
>>   closes the communication.
>
> Also being discussed at <https://issues.guix.gnu.org/48468#12>.

I got confirmation that the cache-bypass-threshold hypothesis holds, at
least for system tests.

Namely, looking at <https://ci.guix.gnu.org/build/2258097/details>,
which ends like this:

--8<---------------cut here---------------start------------->8---
@ substituter-succeeded /gnu/store/qh2876i5l1wvxgwhg9fbl9zmb3px3n2m-gc-roots.drv
fetching path `/gnu/store/fh9dnmrfsz429pwqmvsjnk0snlm959kc-xdg-mime-database-builder'...
@ substituter-started /gnu/store/fh9dnmrfsz429pwqmvsjnk0snlm959kc-xdg-mime-database-builder substitute
Downloading http://141.80.167.131/nar/lzip/fh9dnmrfsz429pwqmvsjnk0snlm959kc-xdg-mime-database-builder...
. xdg-mime-database-builder                    3.6MiB/s 00:00 | 3KiB transferred. xdg-mime-database-builder                    1.9MiB/s 00:00 | 3KiB transferred

@ substituter-succeeded /gnu/store/fh9dnmrfsz429pwqmvsjnk0snlm959kc-xdg-mime-database-builder
cannot build missing derivation ‘/gnu/store/4r1wij3bzj9zv75ds82a93jl7bcman2x-installed-extlinux-os.drv’
--8<---------------cut here---------------end--------------->8---

Looking at the nginx and ‘guix publish’ logs, I found that the missing
substitute is not that of 4r1wij3bzj9zv75ds82a93jl7bcman2x (the .drv
itself) but rather that of a dependency of that .drv:

  [14/Oct/2023:23:22:09 +0200] "GET /wqqzcxrhbnv0nzg64iiiqd5grr4vk2zg.narinfo HTTP/1.1" 404 58 "-" "GNU Guile"

That item’s size is above the cache bypass threshold of 100 MiB as
currently configured on berlin:

--8<---------------cut here---------------start------------->8---
$ du -hs /gnu/store/wqqzcxrhbnv0nzg64iiiqd5grr4vk2zg-guix-5a6b1a5
124M    /gnu/store/wqqzcxrhbnv0nzg64iiiqd5grr4vk2zg-guix-5a6b1a5
--8<---------------cut here---------------end--------------->8---

The immediate fix/workaround is to raise that threshold.

A better solution would be for system tests to depend on a fixed-output
derivation for the Guix source instead of the “source” above (I use
“source” as it is used in the context of <derivation>).

Thanks,
Ludo’.




Information forwarded to bug-guix@gnu.org:
bug#54447; Package guix. (Sun, 15 Oct 2023 20:36:01 GMT) (full text, mbox, link).


Message #43 received at 54447@debbugs.gnu.org (full text, mbox, reply):

From: Ludovic Courtès <ludo@gnu.org>
To: 54447@debbugs.gnu.org
Subject: Re: bug#54447: cuirass: missing derivation error
Date: Sun, 15 Oct 2023 22:34:25 +0200
Ludovic Courtès <ludo@gnu.org> skribis:

> Looking at the nginx and ‘guix publish’ logs, I found that the missing
> substitute is not that of 4r1wij3bzj9zv75ds82a93jl7bcman2x (the .drv
> itself) but rather that of a dependency of that .drv:
>
>   [14/Oct/2023:23:22:09 +0200] "GET /wqqzcxrhbnv0nzg64iiiqd5grr4vk2zg.narinfo HTTP/1.1" 404 58 "-" "GNU Guile"
>
> That item’s size is above the cache bypass threshold of 100 MiB as
> currently configured on berlin:
>
> $ du -hs /gnu/store/wqqzcxrhbnv0nzg64iiiqd5grr4vk2zg-guix-5a6b1a5
> 124M    /gnu/store/wqqzcxrhbnv0nzg64iiiqd5grr4vk2zg-guix-5a6b1a5
>
> The immediate fix/workaround is to raise that threshold.

I raised the threshold to 150 MiB in maintenance.git commit
213384e43de63ce3a5a55599e8fb89891ffef7eb.

I reconfigured berlin and restarted ‘guix publish’ seconds ago.
Hopefully next time installation tests won’t have that problem.

Ludo’.




Information forwarded to bug-guix@gnu.org:
bug#54447; Package guix. (Sun, 15 Oct 2023 20:43:01 GMT) (full text, mbox, link).


Message #46 received at 54447@debbugs.gnu.org (full text, mbox, reply):

From: Ludovic Courtès <ludo@gnu.org>
To: Mathieu Othacehe <othacehe@gnu.org>
Cc: 54447@debbugs.gnu.org, guix-sysadmin <guix-sysadmin@gnu.org>
Subject: Re: bug#54447: cuirass: missing derivation error
Date: Sun, 15 Oct 2023 22:42:14 +0200
Ludovic Courtès <ludo@gnu.org> skribis:

> In addition to the mcron job, Cuirass’s own ‘register-gc-roots’
> procedure periodically deletes GC roots older than ‘%gc-roots-ttl’ (30
> days in practice).  That’s okay, except that it would be safer to delete
> GC roots for a .drv if and only if it’s been built already.

Fixed in Cuirass commit 55af0f70c0d4938b8eda777382bbc4d8f5698a37.

Ludo'.




Information forwarded to bug-guix@gnu.org:
bug#54447; Package guix. (Mon, 16 Oct 2023 13:27:01 GMT) (full text, mbox, link).


Message #49 received at 54447@debbugs.gnu.org (full text, mbox, reply):

From: Maxim Cournoyer <maxim.cournoyer@gmail.com>
To: Ludovic Courtès <ludo@gnu.org>
Cc: Mathieu Othacehe <othacehe@gnu.org>, 宋文武 <iyzsong@envs.net>, 54447@debbugs.gnu.org
Subject: Re: bug#54447: cuirass: missing derivation error
Date: Mon, 16 Oct 2023 09:25:20 -0400
Hi,

Ludovic Courtès <ludo@gnu.org> writes:

> Maxim Cournoyer <maxim.cournoyer@gmail.com> skribis:
>
>> Another example: https://ci.guix.gnu.org/build/1982454/details
>>
>> substitute: 
>> substitute: [Kupdating substitutes from 'http://10.0.0.1'...   0.0%
>> substitute: [Kupdating substitutes from 'http://10.0.0.1'... 100.0%
>> cannot build missing derivation ?/gnu/store/vwhgs9dkj9spryglb180j27dr5vidjxv-ecl-23.9.9.drv?
>
> This one is from Sep. 9, which is before I deployed the remote-worker
> fixes, so I’ll dismiss it (happy to look at more recent ones though!).
>
> Tip of the day: M-: (build-farm-build 1982454)

I don't have such a function in scope, is this from the guix-emacs
package?

-- 
Thanks,
Maxim




Information forwarded to bug-guix@gnu.org:
bug#54447; Package guix. (Mon, 16 Oct 2023 17:40:02 GMT) (full text, mbox, link).


Message #52 received at 54447@debbugs.gnu.org (full text, mbox, reply):

From: Ludovic Courtès <ludo@gnu.org>
To: Maxim Cournoyer <maxim.cournoyer@gmail.com>
Cc: Mathieu Othacehe <othacehe@gnu.org>, 宋文武 <iyzsong@envs.net>, 54447@debbugs.gnu.org
Subject: Re: bug#54447: cuirass: missing derivation error
Date: Mon, 16 Oct 2023 19:39:01 +0200
Maxim Cournoyer <maxim.cournoyer@gmail.com> skribis:

>> Tip of the day: M-: (build-farm-build 1982454)
>
> I don't have such a function in scope, is this from the guix-emacs
> package?

It’s from the ‘emacs-build-farm’ package, which I recommend.  :-)

Ludo’.




Information forwarded to bug-guix@gnu.org:
bug#54447; Package guix. (Mon, 16 Oct 2023 17:46:02 GMT) (full text, mbox, link).


Message #55 received at 54447@debbugs.gnu.org (full text, mbox, reply):

From: Ludovic Courtès <ludo@gnu.org>
To: Mathieu Othacehe <othacehe@gnu.org>
Cc: 54447@debbugs.gnu.org, guix-sysadmin <guix-sysadmin@gnu.org>, Maxim Cournoyer <maxim.cournoyer@gmail.com>
Subject: Re: bug#54447: cuirass: missing derivation error
Date: Mon, 16 Oct 2023 19:44:41 +0200
Ludovic Courtès <ludo@gnu.org> skribis:

> Turns out that ‘cleanup-cuirass-roots’ in maintenance.git, used as an
> mcron job, explicitly removes GC roots for things like *-os-encrypted
> once they’re more than two days old, as well as GC roots for the
> corresponding .drv.
>
> I think this was increasing the likelihood that a .drv would be GC’d by
> the time we run the test: under high load¹, it’s plausible that a system
> test wouldn’t be built within two days after it’s been queued.
>
> I’m proposing the change below to address this; I don’t think we need
> ‘--gc-keep-outputs --gc-keep-derivations’ anymore now that we keep
> things in ‘guix publish’ cache first and foremost.

I pushed a variant of this patch:

  053839d hydra: services: Leave “guix-binary.tar.xz” GC roots.
  e40d961 hydra: services: Preserve Cuirass .drv GC roots.
  b8fc66c hydra: cuirass: Fix build product regexps.

I didn’t dare remove “--gc-keep-derivations”.  I reconfigured berlin
just now from this commit and restarted mcron (I didn’t restart
guix-daemon to avoid downtime; we should do that when the queue is close
to empty).

We’ll have to monitor disk usage to make sure it’s not negatively
affected.

Ludo’.




Information forwarded to bug-guix@gnu.org:
bug#54447; Package guix. (Mon, 20 Nov 2023 19:10:01 GMT) (full text, mbox, link).


Message #58 received at 54447@debbugs.gnu.org (full text, mbox, reply):

From: Maxim Cournoyer <maxim.cournoyer@gmail.com>
To: Ludovic Courtès <ludo@gnu.org>
Cc: Mathieu Othacehe <othacehe@gnu.org>, 宋文武 <iyzsong@envs.net>, 54447@debbugs.gnu.org
Subject: Re: bug#54447: cuirass: missing derivation error
Date: Mon, 20 Nov 2023 14:09:17 -0500
Hi Ludovic,

Ludovic Courtès <ludo@gnu.org> writes:

> Maxim Cournoyer <maxim.cournoyer@gmail.com> skribis:
>
>> Another example: https://ci.guix.gnu.org/build/1982454/details
>>
>> substitute: 
>> substitute: [Kupdating substitutes from 'http://10.0.0.1'...   0.0%
>> substitute: [Kupdating substitutes from 'http://10.0.0.1'... 100.0%
>> cannot build missing derivation ?/gnu/store/vwhgs9dkj9spryglb180j27dr5vidjxv-ecl-23.9.9.drv?
>
> This one is from Sep. 9, which is before I deployed the remote-worker
> fixes, so I’ll dismiss it (happy to look at more recent ones though!).

Here's a more recent occurrence:
https://ci.guix.gnu.org/build/2635272/details

I haven't restarted it to leave proof of its existence :-)

-- 
Thanks,
Maxim




Information forwarded to bug-guix@gnu.org:
bug#54447; Package guix. (Thu, 04 Apr 2024 21:34:02 GMT) (full text, mbox, link).


Message #61 received at 54447@debbugs.gnu.org (full text, mbox, reply):

From: Ludovic Courtès <ludo@gnu.org>
To: 54447@debbugs.gnu.org
Cc: Mathieu Othacehe <othacehe@gnu.org>, guix-sysadmin <guix-sysadmin@gnu.org>, Maxim Cournoyer <maxim.cournoyer@gmail.com>
Subject: Re: bug#54447: cuirass: missing derivation error
Date: Thu, 04 Apr 2024 23:33:38 +0200
Hello!

News from the everlasting bug!

  cannot build missing derivation ‘/gnu/store/dfgc46q3l8wlnymv49a1wjnxypin8p0y-plink-1.07.drv’

(From <https://ci.guix.gnu.org/build/3861708/>.)

Why was it missing this time?  /var/log/nginx/error.log:

--8<---------------cut here---------------start------------->8---
2024/04/04 17:15:03 [error] 98751#0: *152293778 upstream timed out (110: Connection timed out) while reading response header from upstream, client: 141.80.167.169, server: ci.guix.gnu.org, request: "GET /dfgc46q3l8wlnymv49a1wjnxypin8p0y.narinfo HTTP/1.1", upstream: "http://127.0.0.1:3000/dfgc46q3l8wlnymv49a1wjnxypin8p0y.narinfo", host: "141.80.167.131"
--8<---------------cut here---------------end--------------->8---

Oops!  (There are dozens of upstream timeouts logged on that minute.)

/var/log/guix-publish.log:

--8<---------------cut here---------------start------------->8---
2024-04-04 17:14:51 GET /nar/lzip/pz39bkq7pd1hgy5rwiynqa33gyjvpgs5-python-pygments-2.12.0
2024-04-04 17:14:51 GET /z2xxwwxswdd4b8c8iwmxhqnqbp5nwz09.narinfo
2024-04-04 17:14:51 GET /lgyck285bsxzwrnh3x5ix5dwzd3n3wga.narinfo
2024-04-04 17:14:51 GET /nar/zstd/jxkglr445f215m2faqz1i2lgmbans4rf-texlive-amsmath-66594-doc
2024-04-04 17:15:33 GET /qg5cxb869i42jn7x2dm6k5l41ikkz21w.narinfo
2024-04-04 17:15:33 GET /nar/zstd/i2hp3q2pfhsyl0al7z38am7cqpddi4qr-texlive-capt-of-66594-doc
2024-04-04 17:15:33 GET /hh0gdbljj3cjdnjbr88kfm21mhys5sy7.narinfo
2024-04-04 17:15:33 GET /dfgc46q3l8wlnymv49a1wjnxypin8p0y.narinfo
2024-04-04 17:15:33 GET /yj63wifalfr6sla42h7mkqg011qrl5d0.narinfo
2024-04-04 17:15:33 GET /h2s2g2adxbnd67g34mnjnpcr6p3nhr69.narinfo
2024-04-04 17:15:33 -> GET /h2s2g2adxbnd67g34mnjnpcr6p3nhr69.narinfo: 404
2024-04-04 17:15:33 GET /nar/lzip/6zxlrw15b9dsv73s7v5fqabl7iv5v5il-python-exceptiongroup-1.1.1
2024-04-04 17:15:33 GET /nar/zstd/pychjd114abscbqlzcr3s7myf1497vw2-julia-compilersupportlibraries-jll-0.4.0%2B1
--8<---------------cut here---------------end--------------->8---

‘guix publish’ replied, but 40s too late (nginx has
“proxy_connect_timeout 10s;” for .narinfo URLs¹).

Notice the 40s pause time between 17:14:51 and 17:15:33.  Stop-the-world
GC?  Unlikely, because ‘guix publish’ had been running for ~3h, so even
with a leak², it’s hard to believe GC could take this long.

Ludo’.

¹ https://git.savannah.gnu.org/cgit/guix/maintenance.git/tree/hydra/nginx/berlin.scm#n103
² https://issues.guix.gnu.org/69596




Information forwarded to bug-guix@gnu.org:
bug#54447; Package guix. (Sun, 14 Apr 2024 00:17:03 GMT) (full text, mbox, link).


Message #64 received at 54447@debbugs.gnu.org (full text, mbox, reply):

From: John Kehayias <john.kehayias@protonmail.com>
To: Ludovic Courtès <ludo@gnu.org>
Cc: 54447@debbugs.gnu.org, guix-sysadmin <guix-sysadmin@gnu.org>, Maxim Cournoyer <maxim.cournoyer@gmail.com>, Mathieu Othacehe <othacehe@gnu.org>
Subject: Re: bug#54447: cuirass: missing derivation error
Date: Sun, 14 Apr 2024 00:15:45 +0000
Hi all,

On Thu, Apr 04, 2024 at 11:33 PM, Ludovic Courtès wrote:

> Hello!
>
> News from the everlasting bug!
>
>   cannot build missing derivation
> ‘/gnu/store/dfgc46q3l8wlnymv49a1wjnxypin8p0y-plink-1.07.drv’
>
> (From <https://ci.guix.gnu.org/build/3861708/>.)
>
> Why was it missing this time?  /var/log/nginx/error.log:
>
> 2024/04/04 17:15:03 [error] 98751#0: *152293778 upstream timed out (110: Connection timed out) while reading response header from upstream, client: 141.80.167.169, server: ci.guix.gnu.org, request: "GET /dfgc46q3l8wlnymv49a1wjnxypin8p0y.narinfo HTTP/1.1", upstream: "http://127.0.0.1:3000/dfgc46q3l8wlnymv49a1wjnxypin8p0y.narinfo", host: "141.80.167.131"
>
>
> Oops!  (There are dozens of upstream timeouts logged on that minute.)
>
> /var/log/guix-publish.log:
>
> 2024-04-04 17:14:51 GET /nar/lzip/pz39bkq7pd1hgy5rwiynqa33gyjvpgs5-python-pygments-2.12.0
> 2024-04-04 17:14:51 GET /z2xxwwxswdd4b8c8iwmxhqnqbp5nwz09.narinfo
> 2024-04-04 17:14:51 GET /lgyck285bsxzwrnh3x5ix5dwzd3n3wga.narinfo
> 2024-04-04 17:14:51 GET /nar/zstd/jxkglr445f215m2faqz1i2lgmbans4rf-texlive-amsmath-66594-doc
> 2024-04-04 17:15:33 GET /qg5cxb869i42jn7x2dm6k5l41ikkz21w.narinfo
> 2024-04-04 17:15:33 GET /nar/zstd/i2hp3q2pfhsyl0al7z38am7cqpddi4qr-texlive-capt-of-66594-doc
> 2024-04-04 17:15:33 GET /hh0gdbljj3cjdnjbr88kfm21mhys5sy7.narinfo
> 2024-04-04 17:15:33 GET /dfgc46q3l8wlnymv49a1wjnxypin8p0y.narinfo
> 2024-04-04 17:15:33 GET /yj63wifalfr6sla42h7mkqg011qrl5d0.narinfo
> 2024-04-04 17:15:33 GET /h2s2g2adxbnd67g34mnjnpcr6p3nhr69.narinfo
> 2024-04-04 17:15:33 -> GET /h2s2g2adxbnd67g34mnjnpcr6p3nhr69.narinfo: 404
> 2024-04-04 17:15:33 GET /nar/lzip/6zxlrw15b9dsv73s7v5fqabl7iv5v5il-python-exceptiongroup-1.1.1
> 2024-04-04 17:15:33 GET /nar/zstd/pychjd114abscbqlzcr3s7myf1497vw2-julia-compilersupportlibraries-jll-0.4.0%2B1
>
> ‘guix publish’ replied, but 40s too late (nginx has
> “proxy_connect_timeout 10s;” for .narinfo URLs¹).
>
> Notice the 40s pause time between 17:14:51 and 17:15:33.  Stop-the-world
> GC?  Unlikely, because ‘guix publish’ had been running for ~3h, so even
> with a leak², it’s hard to believe GC could take this long.
>
> Ludo’.
>
> ¹
> https://git.savannah.gnu.org/cgit/guix/maintenance.git/tree/hydra/nginx/berlin.scm#n103
> ² https://issues.guix.gnu.org/69596

I don't have any insight, but if anyone wants to see this in action at a
large scale, take look at pretty much any red dot on
https://ci.guix.gnu.org/eval/1238471/dashboard?system=i686-linux

From my quick look all the CL and texlive failures were all missing
derivation. I've tried restarting a bunch to get i686 coverage going, so
hopefully some will disappear. But I can't/won't manually restart the
thousands(?) of failed builds. I didn't see such issues on x86_64, while
other architectures take a really long time to build on Berlin so I
haven't looked.

I don't know if this is helpful, but thought I would chime in if anyone
wants potentially a bunch of data. And if there are good ideas to
recover (just restart all builds?) that would be great so mesa-updates
will be build on i686 since otherwise it looks good.

Thanks!
John





Information forwarded to bug-guix@gnu.org:
bug#54447; Package guix. (Sun, 14 Jul 2024 21:52:01 GMT) (full text, mbox, link).


Message #67 received at 54447@debbugs.gnu.org (full text, mbox, reply):

From: Ludovic Courtès <ludo@gnu.org>
To: 54447@debbugs.gnu.org
Cc: Mathieu Othacehe <othacehe@gnu.org>, guix-sysadmin <guix-sysadmin@gnu.org>, Maxim Cournoyer <maxim.cournoyer@gmail.com>
Subject: Re: bug#54447: cuirass: missing derivation error
Date: Sun, 14 Jul 2024 23:49:20 +0200
Hi!

Ludovic Courtès <ludo@gnu.org> skribis:

> News from the everlasting bug!
>
>   cannot build missing derivation ‘/gnu/store/dfgc46q3l8wlnymv49a1wjnxypin8p0y-plink-1.07.drv’

[...]

> ‘guix publish’ replied, but 40s too late (nginx has
> “proxy_connect_timeout 10s;” for .narinfo URLs¹).

While the exact reason why ‘guix publish’ exhibits this behavior is
unclear, the good news is that this is “fixed” by having ‘cuirass
remote-worker’ retry when it fails to substitute a .drv (thanks Chris
for the obvious-in-hindsight tip!):

  https://git.savannah.gnu.org/cgit/guix/guix-cuirass.git/commit/?id=2365ba786c805477fcbae6eaeb358b0dd0501598
  https://git.savannah.gnu.org/cgit/guix/guix-cuirass.git/commit/?id=71426663f6ea32152782645e4632168dd2b18602

Furthermore, workers can now reject builds if they fail to substitute
the .drv, in which case ‘cuirass remote-server’ either reschedules or
cancels the build:

  https://git.savannah.gnu.org/cgit/guix/guix-cuirass.git/commit/?id=a909fa99340db5e5cd64612ea4e07e929dc643ad

This has been deployed a few days ago on berlin and on its x86_64 build
machines.  Working well so far!

Ludo’.




bug closed, send any further explanations to 54447@debbugs.gnu.org and Mathieu Othacehe <othacehe@gnu.org> Request was from Ludovic Courtès <ludo@gnu.org> to control@debbugs.gnu.org. (Sun, 14 Jul 2024 21:52:02 GMT) (full text, mbox, link).


bug archived. Request was from Debbugs Internal Request <help-debbugs@gnu.org> to internal_control@debbugs.gnu.org. (Mon, 12 Aug 2024 11:24:06 GMT) (full text, mbox, link).


Send a report that this bug log contains spam.


debbugs.gnu.org maintainers <help-debbugs@gnu.org>. Last modified: Sun Sep 8 03:04:22 2024; Machine Name: wallace-server

GNU bug tracking system

Debbugs is free software and licensed under the terms of the GNU Public License version 2. The current version can be obtained from https://bugs.debian.org/debbugs-source/.

Copyright © 1999 Darren O. Benham, 1997,2003 nCipher Corporation Ltd, 1994-97 Ian Jackson, 2005-2017 Don Armstrong, and many other contributors.