GNU bug report logs

#24496 offloading should fall back to local build after n tries

PackageSource(s)Maintainer(s)
guix PTS Buildd Popcon
Reply or subscribe to this bug. View this bug as an mbox, status mbox, or maintainer mbox

Report forwarded to bug-guix@gnu.org:
bug#24496; Package guix. (Wed, 21 Sep 2016 15:41:02 GMT) (full text, mbox, link).


Acknowledgement sent to ng0 <ngillmann@runbox.com>:
New bug report received and forwarded. Copy sent to bug-guix@gnu.org. (Wed, 21 Sep 2016 15:41:02 GMT) (full text, mbox, link).


Message #5 received at submit@debbugs.gnu.org (full text, mbox, reply):

From: ng0 <ngillmann@runbox.com>
To: bug-guix@gnu.org
Subject: offloading should fall back to local build after n tries
Date: Wed, 21 Sep 2016 09:39:48 +0000
When I forgot that my build machine is offline and I did not pass
--no-build-hook, the offloading keeps trying forever until I had to
cancel the build, boot the build-machine and started the build again.

A solution could be a config option or default behavior which after
failing to offload for n times gives up and uses the local builder.

Is this desired at all? Setups like hydra could get problems, but for
small setups with the same architecture there could be a solution beyond
--no-build-hook?
-- 
              ng0




Information forwarded to bug-guix@gnu.org:
bug#24496; Package guix. (Mon, 26 Sep 2016 15:51:01 GMT) (full text, mbox, link).


Message #8 received at 24496@debbugs.gnu.org (full text, mbox, reply):

From: ludo@gnu.org (Ludovic Courtès)
To: ng0 <ngillmann@runbox.com>
Cc: 24496@debbugs.gnu.org
Subject: Re: bug#24496: offloading should fall back to local build after n tries
Date: Mon, 26 Sep 2016 18:20:51 +0900
Hello!

ng0 <ngillmann@runbox.com> skribis:

> When I forgot that my build machine is offline and I did not pass
> --no-build-hook, the offloading keeps trying forever until I had to
> cancel the build, boot the build-machine and started the build again.
>
> A solution could be a config option or default behavior which after
> failing to offload for n times gives up and uses the local builder.
>
> Is this desired at all? Setups like hydra could get problems, but for
> small setups with the same architecture there could be a solution beyond
> --no-build-hook?

Like you say, on Hydra-style setup this could be a problem: the
front-end machine may have --max-jobs=0, meaning that it cannot perform
builds on its own.

So I guess we would need a command-line option to select a different
behavior.  I’m not sure how to do that because ‘guix offload’ is
“hidden” behind ‘guix-daemon’, so there’s no obvious place for such an
option.

In the meantime, you could also hack up your machines.scm: it would
return a list where unreachable machines have been filtered out.

Ludo’.




Information forwarded to bug-guix@gnu.org:
bug#24496; Package guix. (Tue, 04 Oct 2016 17:10:01 GMT) (full text, mbox, link).


Message #11 received at 24496@debbugs.gnu.org (full text, mbox, reply):

From: ng0 <ngillmann@runbox.com>
To: Ludovic Courtès <ludo@gnu.org>
Cc: 24496@debbugs.gnu.org
Subject: Re: bug#24496: offloading should fall back to local build after n tries
Date: Tue, 04 Oct 2016 17:08:58 +0000
Ludovic Courtès <ludo@gnu.org> writes:

> Hello!
>
> ng0 <ngillmann@runbox.com> skribis:
>
>> When I forgot that my build machine is offline and I did not pass
>> --no-build-hook, the offloading keeps trying forever until I had to
>> cancel the build, boot the build-machine and started the build again.
>>
>> A solution could be a config option or default behavior which after
>> failing to offload for n times gives up and uses the local builder.
>>
>> Is this desired at all? Setups like hydra could get problems, but for
>> small setups with the same architecture there could be a solution beyond
>> --no-build-hook?
>
> Like you say, on Hydra-style setup this could be a problem: the
> front-end machine may have --max-jobs=0, meaning that it cannot perform
> builds on its own.
>
> So I guess we would need a command-line option to select a different
> behavior.  I’m not sure how to do that because ‘guix offload’ is
> “hidden” behind ‘guix-daemon’, so there’s no obvious place for such an
> option.

Could the daemon run with --enable-hydra-style or --disable-hydra-style
and --disable-hydra-style would allow falling back to local build if
after a defined time - keeping slow connections in mind - the machine
did not reply.

> In the meantime, you could also hack up your machines.scm: it would
> return a list where unreachable machines have been filtered out.

How can I achieve this?

And to append to this bug: it seems to me that offloading requires 1
lsh-key for each
build-machine. (https://lists.gnu.org/archive/html/help-guix/2016-10/msg00007.html)
and that you can not directly address them (say I want to create some
system where I want to build on machine 1 AND machine 2. Having 2 x86_64
in machines.scm only selects one of them (if 2 were working, see linked
thread) and builds on the one which is accessible first. If however the
first machine is somehow blocked and it fails, therefore terminates lsh
connection, the build does not happen at all.

Leaving out the problems, what I want to do in short: How could I build
on both systems at the same time when I desire to do so?

> Ludo’.
>

-- 




Information forwarded to bug-guix@gnu.org:
bug#24496; Package guix. (Wed, 05 Oct 2016 11:37:01 GMT) (full text, mbox, link).


Message #14 received at 24496@debbugs.gnu.org (full text, mbox, reply):

From: ludo@gnu.org (Ludovic Courtès)
To: ng0 <ngillmann@runbox.com>
Cc: 24496@debbugs.gnu.org
Subject: Re: bug#24496: offloading should fall back to local build after n tries
Date: Wed, 05 Oct 2016 13:36:20 +0200
ng0 <ngillmann@runbox.com> skribis:

> Ludovic Courtès <ludo@gnu.org> writes:

[...]

>> Like you say, on Hydra-style setup this could be a problem: the
>> front-end machine may have --max-jobs=0, meaning that it cannot perform
>> builds on its own.
>>
>> So I guess we would need a command-line option to select a different
>> behavior.  I’m not sure how to do that because ‘guix offload’ is
>> “hidden” behind ‘guix-daemon’, so there’s no obvious place for such an
>> option.
>
> Could the daemon run with --enable-hydra-style or --disable-hydra-style
> and --disable-hydra-style would allow falling back to local build if
> after a defined time - keeping slow connections in mind - the machine
> did not reply.

That would be too ad-hoc IMO, and the problem mentioned above remains.

>> In the meantime, you could also hack up your machines.scm: it would
>> return a list where unreachable machines have been filtered out.
>
> How can I achieve this?

Something like:

  (define the-machine (build-machine …))

  (if (managed-to-connect-timely the-machine)
      (list the-machine)
      '())

… where ‘managed-to-connect-timely’ would try to connect to the
machine with a timeout.

> And to append to this bug: it seems to me that offloading requires 1
> lsh-key for each
> build-machine.

The main machine needs to be able to connect to each build machine over
SSH, so indeed, that requires proper SSH key registration (host keys and
authorized user keys).

> (https://lists.gnu.org/archive/html/help-guix/2016-10/msg00007.html)
> and that you can not directly address them (say I want to create some
> system where I want to build on machine 1 AND machine 2. Having 2
> x86_64 in machines.scm only selects one of them (if 2 were working,
> see linked thread) and builds on the one which is accessible first. If
> however the first machine is somehow blocked and it fails, therefore
> terminates lsh connection, the build does not happen at all.

The code that selects machines is in (guix scripts offload),
specifically ‘choose-build-machine’.  It tries to choose the “best”
machine, which means, roughly, the fastest and least loaded one.

HTH,
Ludo’.




Information forwarded to bug-guix@gnu.org:
bug#24496; Package guix. (Thu, 16 Dec 2021 13:02:01 GMT) (full text, mbox, link).


Message #17 received at 24496@debbugs.gnu.org (full text, mbox, reply):

From: zimoun <zimon.toutoune@gmail.com>
To: ludo@gnu.org (Ludovic Courtès)
Cc: 24496@debbugs.gnu.org, ng0 <ngillmann@runbox.com>
Subject: Re: bug#24496: offloading should fall back to local build after n tries
Date: Thu, 16 Dec 2021 13:52:14 +0100
Hi,

I am just hitting this old bug#24496 [1].

On Mon, 26 Sep 2016 at 18:20, ludo@gnu.org (Ludovic Courtès) wrote:
> ng0 <ngillmann@runbox.com> skribis:
>
>> When I forgot that my build machine is offline and I did not pass
>> --no-build-hook, the offloading keeps trying forever until I had to
>> cancel the build, boot the build-machine and started the build again.

[...]

> Like you say, on Hydra-style setup this could be a problem: the
> front-end machine may have --max-jobs=0, meaning that it cannot perform
> builds on its own.
>
> So I guess we would need a command-line option to select a different
> behavior.  I’m not sure how to do that because ‘guix offload’ is
> “hidden” behind ‘guix-daemon’, so there’s no obvious place for such an
> option.

When the build machine used to offload is offline and the master daemon
is --max-jobs=0, I expect X tries (leading to timeout) and then just
fails with a hint, where X is defined by user.  WDYT?


> In the meantime, you could also hack up your machines.scm: it would
> return a list where unreachable machines have been filtered out.

Maybe, this could be done by “guix offload”.


Cheers,
simon


1: <http://issues.guix.gnu.org/issue/24496>




Information forwarded to bug-guix@gnu.org:
bug#24496; Package guix. (Fri, 17 Dec 2021 15:34:02 GMT) (full text, mbox, link).


Message #20 received at 24496@debbugs.gnu.org (full text, mbox, reply):

From: Ludovic Courtès <ludo@gnu.org>
To: zimoun <zimon.toutoune@gmail.com>
Cc: 24496@debbugs.gnu.org, Maxim Cournoyer <maxim.cournoyer@gmail.com>, ng0 <ngillmann@runbox.com>
Subject: Re: bug#24496: offloading should fall back to local build after n tries
Date: Fri, 17 Dec 2021 16:33:46 +0100
Hi!

zimoun <zimon.toutoune@gmail.com> skribis:

> I am just hitting this old bug#24496 [1].
>
> On Mon, 26 Sep 2016 at 18:20, ludo@gnu.org (Ludovic Courtès) wrote:
>> ng0 <ngillmann@runbox.com> skribis:
>>
>>> When I forgot that my build machine is offline and I did not pass
>>> --no-build-hook, the offloading keeps trying forever until I had to
>>> cancel the build, boot the build-machine and started the build again.
>
> [...]
>
>> Like you say, on Hydra-style setup this could be a problem: the
>> front-end machine may have --max-jobs=0, meaning that it cannot perform
>> builds on its own.
>>
>> So I guess we would need a command-line option to select a different
>> behavior.  I’m not sure how to do that because ‘guix offload’ is
>> “hidden” behind ‘guix-daemon’, so there’s no obvious place for such an
>> option.
>
> When the build machine used to offload is offline and the master daemon
> is --max-jobs=0, I expect X tries (leading to timeout) and then just
> fails with a hint, where X is defined by user.  WDYT?
>
>
>> In the meantime, you could also hack up your machines.scm: it would
>> return a list where unreachable machines have been filtered out.
>
> Maybe, this could be done by “guix offload”.

Prior to commit efbf5fdd01817ea75de369e3dd2761a85f8f7dd5, this was the
case: an unreachable machine would have ‘machine-load’ return +inf.0,
and so it would be discarded from the list of candidates.

However, I think this behavior was unintentionally lost in
efbf5fdd01817ea75de369e3dd2761a85f8f7dd5.  Maxim, WDYT?

Thanks,
Ludo’.




Information forwarded to bug-guix@gnu.org:
bug#24496; Package guix. (Fri, 17 Dec 2021 21:58:02 GMT) (full text, mbox, link).


Message #23 received at 24496@debbugs.gnu.org (full text, mbox, reply):

From: Maxim Cournoyer <maxim.cournoyer@gmail.com>
To: Ludovic Courtès <ludo@gnu.org>
Cc: ng0 <ngillmann@runbox.com>, 24496@debbugs.gnu.org, zimoun <zimon.toutoune@gmail.com>
Subject: Re: bug#24496: offloading should fall back to local build after n tries
Date: Fri, 17 Dec 2021 16:57:33 -0500
Hello Ludovic,

Ludovic Courtès <ludo@gnu.org> writes:

> Hi!
>
> zimoun <zimon.toutoune@gmail.com> skribis:
>
>> I am just hitting this old bug#24496 [1].
>>
>> On Mon, 26 Sep 2016 at 18:20, ludo@gnu.org (Ludovic Courtès) wrote:
>>> ng0 <ngillmann@runbox.com> skribis:
>>>
>>>> When I forgot that my build machine is offline and I did not pass
>>>> --no-build-hook, the offloading keeps trying forever until I had to
>>>> cancel the build, boot the build-machine and started the build again.
>>
>> [...]
>>
>>> Like you say, on Hydra-style setup this could be a problem: the
>>> front-end machine may have --max-jobs=0, meaning that it cannot perform
>>> builds on its own.
>>>
>>> So I guess we would need a command-line option to select a different
>>> behavior.  I’m not sure how to do that because ‘guix offload’ is
>>> “hidden” behind ‘guix-daemon’, so there’s no obvious place for such an
>>> option.
>>
>> When the build machine used to offload is offline and the master daemon
>> is --max-jobs=0, I expect X tries (leading to timeout) and then just
>> fails with a hint, where X is defined by user.  WDYT?
>>
>>
>>> In the meantime, you could also hack up your machines.scm: it would
>>> return a list where unreachable machines have been filtered out.
>>
>> Maybe, this could be done by “guix offload”.
>
> Prior to commit efbf5fdd01817ea75de369e3dd2761a85f8f7dd5, this was the
> case: an unreachable machine would have ‘machine-load’ return +inf.0,
> and so it would be discarded from the list of candidates.
>
> However, I think this behavior was unintentionally lost in
> efbf5fdd01817ea75de369e3dd2761a85f8f7dd5.  Maxim, WDYT?

I just reviewed this commit, and don't see anywhere where the behavior
would have changed.  The discarding happens here:

--8<---------------cut here---------------start------------->8---
-         (if (and node (< load 2.) (>= space %minimum-disk-space))
+         (if (and node
+                  (or (not threshold) (< load threshold))
+                  (>= space %minimum-disk-space))
--8<---------------cut here---------------end--------------->8---

previously load could be set to +inf.0.  Now it is a float between 0.0
and 1.0, with threshold defaulting to 0.6.

As far as I remember, this has always been a problem for me (busy
offload machines being forever retried with no fallback to the local
machine).

Thanks,

Maxim




Information forwarded to bug-guix@gnu.org:
bug#24496; Package guix. (Sat, 18 Dec 2021 00:12:01 GMT) (full text, mbox, link).


Message #26 received at 24496@debbugs.gnu.org (full text, mbox, reply):

From: zimoun <zimon.toutoune@gmail.com>
To: Maxim Cournoyer <maxim.cournoyer@gmail.com>, Ludovic Courtès <ludo@gnu.org>
Cc: 24496@debbugs.gnu.org, ng0 <ngillmann@runbox.com>
Subject: Re: bug#24496: offloading should fall back to local build after n tries
Date: Sat, 18 Dec 2021 01:10:49 +0100
Hi,

I have not checked all the details, since the code of “guix offload” is
run by root, IIUC and so it is not as friendly as usual to debug. :-)

On Fri, 17 Dec 2021 at 16:57, Maxim Cournoyer <maxim.cournoyer@gmail.com> wrote:

>> However, I think this behavior was unintentionally lost in
>> efbf5fdd01817ea75de369e3dd2761a85f8f7dd5.  Maxim, WDYT?
>
> I just reviewed this commit, and don't see anywhere where the behavior
> would have changed.  The discarding happens here:

[...]

> previously load could be set to +inf.0.  Now it is a float between 0.0
> and 1.0, with threshold defaulting to 0.6.

My /etc/guix/machines.scm contains only one machine and --max-jobs=0.

Because the machine is unreachable, IIUC, ’node’ is (or should be) false
and ’load’ is thus not involved, I guess.  Indeed, ’report-load’
displays nothing, and instead I get:

--8<---------------cut here---------------start------------->8---
The following derivation will be built:
   /gnu/store/c1qicg17ygn1a0biq0q4mkprzy4p2x74-hello-2.10.drv
process 75621 acquired build slot '/var/guix/offload/x.x.x.x:22/0'
guix offload: error: failed to connect to 'x.x.x.x': Timeout connecting to x.x.x.x
waiting for locks or build slots...
process 75621 acquired build slot '/var/guix/offload/x.x.x.x:22/0'
guix offload: error: failed to connect to 'x.x.x.x': Timeout connecting to x.x.x.x
process 75621 acquired build slot '/var/guix/offload/x.x.x.x:22/0'
guix offload: error: failed to connect to 'x.x.x.x': Timeout connecting to x.x.x.x
process 75621 acquired build slot '/var/guix/offload/x.x.x.x:22/0'
guix offload: error: failed to connect to 'x.x.x.x': Timeout connecting to x.x.x.x
process 75621 acquired build slot '/var/guix/offload/x.x.x.x:22/0'
  C-c C-c
--8<---------------cut here---------------end--------------->8---


Well, if the machine is not reachable, then ’session’ is false, right?

--8<---------------cut here---------------start------------->8---
@@ -472,11 +480,15 @@ (define (machine-faster? m1 m2)
        (let* ((session (false-if-exception (open-ssh-session best
                                                              %short-timeout)))
               (node    (and session (remote-inferior session)))
-              (load    (and node (normalized-load best (node-load node))))
+              (load    (and node (node-load node)))
+              (threshold (build-machine-overload-threshold best))
               (space   (and node (node-free-disk-space node))))
+         (when load (report-load best load))
          (when node (close-inferior node))
          (when session (disconnect! session))
-         (if (and node (< load 2.) (>= space %minimum-disk-space))
+         (if (and node
+                  (or (not threshold) (< load threshold))
+                  (>= space %minimum-disk-space))
[...]
             (begin
               ;; BEST is unsuitable, so try the next one.
               (when (and space (< space %minimum-disk-space))
                 (format (current-error-port)
                         "skipping machine '~a' because it is low \
on disk space (~,2f MiB free)~%"
                         (build-machine-name best)
                         (/ space (expt 2 20) 1.)))
               (release-build-slot slot)
               (loop others)))))
--8<---------------cut here---------------end--------------->8---

Therefore, the ’else’ branch goes and so the codes does ’(loop others)’.

However, I miss why ’others’ is not empty (only one machine in
/etc/guix/machines.scm).  Well, the message «waiting for locks or build
slots...» suggests that something is restarted and it is not that ’loop’
we are observing but another one.

On daemon side, I do not know what this ’waitingForAWhile’ and
’lastWokenUp’ mean.

--8<---------------cut here---------------start------------->8---
    /* If we are polling goals that are waiting for a lock, then wake
       up after a few seconds at most. */
    if (!waitingForAWhile.empty()) {
        useTimeout = true;
        if (lastWokenUp == 0)
            printMsg(lvlError, "waiting for locks or build slots...");
        if (lastWokenUp == 0 || lastWokenUp > before) lastWokenUp = before;
        timeout.tv_sec = std::max((time_t) 1, (time_t) (lastWokenUp + settings.pollInterval - before));
    } else lastWokenUp = 0;
--8<---------------cut here---------------end--------------->8---


Bah it requires more investigations and I agree with Maxim that
efbf5fdd01817ea75de369e3dd2761a85f8f7dd5 is probably not the issue
there.

Cheers,
simon




Information forwarded to bug-guix@gnu.org:
bug#24496; Package guix. (Tue, 21 Dec 2021 14:29:02 GMT) (full text, mbox, link).


Message #29 received at 24496@debbugs.gnu.org (full text, mbox, reply):

From: Ludovic Courtès <ludo@gnu.org>
To: Maxim Cournoyer <maxim.cournoyer@gmail.com>
Cc: ng0 <ngillmann@runbox.com>, 24496@debbugs.gnu.org, zimoun <zimon.toutoune@gmail.com>
Subject: Re: bug#24496: offloading should fall back to local build after n tries
Date: Tue, 21 Dec 2021 15:28:35 +0100
Hi,

Maxim Cournoyer <maxim.cournoyer@gmail.com> skribis:

> I just reviewed this commit, and don't see anywhere where the behavior
> would have changed.  The discarding happens here:
>
> -         (if (and node (< load 2.) (>= space %minimum-disk-space))
> +         (if (and node
> +                  (or (not threshold) (< load threshold))
> +                  (>= space %minimum-disk-space))
>
> previously load could be set to +inf.0.  Now it is a float between 0.0
> and 1.0, with threshold defaulting to 0.6.

Ah alright, so we’re fine.

> As far as I remember, this has always been a problem for me (busy
> offload machines being forever retried with no fallback to the local
> machine).

OK, I guess I’m overlooking something.

Thanks,
Ludo’.




Send a report that this bug log contains spam.


debbugs.gnu.org maintainers <help-debbugs@gnu.org>. Last modified: Wed Apr 16 03:55:17 2025; Machine Name: wallace-server

GNU bug tracking system

Debbugs is free software and licensed under the terms of the GNU Public License version 2. The current version can be obtained from https://bugs.debian.org/debbugs-source/.

Copyright © 1999 Darren O. Benham, 1997,2003 nCipher Corporation Ltd, 1994-97 Ian Jackson, 2005-2017 Don Armstrong, and many other contributors.