Hey Ludo,
On Tue, Aug 21 2018, Ludovic Courtès wrote:
Toggle quote (5 lines)
> We’ll still want to be able to special-case things like nginx
> that can be “hot-replaced”, though. So perhaps, in addition to
> this patch on the Shepherd side, we’ll need extra stuff in (gnu
> services shepherd).
Yeah, if we expose the replacement field directly, then we'll need
some supporting code in (gnu services shepherd), even if it's just
to detect whether the service is stopped or not before doing a
replacement. Although ideally our interface wouldn't introduce
race conditions like that. (See below for more thoughts on this.)
Toggle quote (11 lines)
> For instance, the ‘actions’ field of <shepherd-service> could,
> by default, include an “upgrade” action that simply sets the
> ‘replacement’ slot. For nginx, we’d provide a custom “upgrade”
> action that does “nginx -s restart” or whatever it is that needs
> to be done.
>
> ‘guix system reconfigure’ would automatically invoke the
> ‘upgrade’ action for each new service.
>
> WDYT?
How many services can we meaningfully upgrade like this? My
understanding is that most of our system services a fed an
immutable configuration file, and thus restarting/reloading won't
actually upgrade them. In order to make an upgrade action work the
service would have to mutate itself into a new correct
configuration, as well as restarting/reloading the underlying
daemon. It's even trickier if the daemon itself has been upgraded,
because then the process will have to be restarted anyway.
At any rate, I think the replacement mechanism (this patch) is
just one way that a service can be reloaded. It would probably be
a good idea to create a higher-level abstraction over it. I think
other mechanisms (like a upgrade/reload action) should be handled
on the Guix side of things.
Toggle quote (28 lines)
>> + (let ((replacement (slot-ref service 'replacement)))
>> + (define (copy-slot! slot)
>> + (slot-set! service slot (slot-ref replacement slot)))
>> + (when replacement
>> + (copy-slot! 'provides)
>> + (copy-slot! 'requires)
>> + (copy-slot! 'respawn?)
>> + (copy-slot! 'start)
>> + (copy-slot! 'stop)
>> + (copy-slot! 'actions)
>> + (copy-slot! 'running)
>> + (copy-slot! 'docstring))
>> + service))
>
> Having a hardcoded list of slots sounds error-prone—surely we’ll
> forget to update it down the road. I wonder what else could be
> done.
>
> One option would be to grab the block asyncs and atomically
> replace the service in the ‘%services’ hash table. Then we only
> need to copy the ‘last-respawns’ slot to the new service, I
> believe. (This changes the object identity of the service but I
> think its OK.)
>
> Another option would be to use GOOPS tricks to iterate over the
> list of slots and have a list of slots *not* to copy. I’m not a
> big fan of this option, though.
My favourite option for this would be to separate the <service>
object into an immutable <service> and a mutable <service-state>.
The <service-state> object would have a reference to a <service>
object in order to invoke actions on it, and it could also hold a
second <service> object as a replacement. Then the swap would be
much more straightforward. I haven't done any real work towards
this, though.
In the short term, I'd rather replace it in the %services hash
table. I did it by copying slots because I wasn't sure I would get
the details of the swap right and didn't have time to properly
work out how to do it. I'll give it a go!
Toggle quote (18 lines)
>> +(let ((service (lookup-running 'test)))
>> + (slot-set! service 'replacement
>> + (make <service>
>
> I wonder if we should let users fiddle with ‘replacement’
> directly, or if we should provide a higher-level construct.
>
> For instance, ‘register-services’ could transparently set the
> ‘replacement’ field for services already registered instead of
> doing:
>
> (assert (null? (lookup-services (canonical-name new))))
>
> Not sure if there are cases where this behavior would be
> undesirable, though.
>
> Thoughts?
With this current patch the replacement field is only checked at
the point when the service is stopped, so the field could only be
set when the service is actually running. I think it makes the
most sense to just replace the service directly if it's not
stopped.
I can't think of any undesirable cases, but having a higher-level
interface is a good idea. At the very least we need to control the
inherent race condition involved in (if running? do-x do-y) for if
the service is stopped after the running? check. At the moment I
think the only thing we have to worry about there is signals, but
if we're going to move to have more parallelism through fibers
then we might need to be even more careful.
I'll try to send through an updated patch later this week.
Carlo