Shepherd catastrophic memory leak

  • Open
  • quality assurance status badge
Details
4 participants
  • Dr. Arne Babenhauserheide
  • Ludovic Courtès
  • Zack Weinberg
  • Tomas Volf
Owner
unassigned
Submitted by
Zack Weinberg
Severity
normal

Debbugs page

Z
Z
Zack Weinberg wrote 5 days ago
(address . bug-guix@gnu.org)
16ef00f8-2083-4141-83f1-5cd084df82c4@app.fastmail.com
I left my Guix System-based web server running for 26 days and PID 1 has
ballooned to consume 75% of all available RAM. Because of this, it can
no longer fork. Which, in turn, means the system is almost but not quite
dead in the water. Daemons that are already running, such as the actual
web server, are fine, but any transient service -- like ssh -- won't
start. I could log in on the console, because getty was already
running, but `reboot` just hangs, and if I log out I expect it won't be
able to start another getty process.

Here is some relevant troubleshooting info:

# uptime
19:08:57 up 26 days 20:17, 1 user, load average: 0.01, 0.02, 0.00

# free
total used free shared buff/cache available
Mem: 2020468 1768960 103008 6472 307064 251508
Swap: 2094056 168268 1925788

# ps -p 1 lc
F UID PID PPID PRI NI VSZ RSS WCHAN STAT TTY TIME COMMAND
4 0 1 0 20 0 1988980 1528612 do_epo Sl ? 175:14 shepher

# grep -v MARK messages
2025-09-14 22:00:48 localhost shepherd[1]: Rotating '/var/log/messages' to '/var/log/messages.1'.
2025-09-14 22:00:48 localhost linux: [1638517.256304] __vm_enough_memory: pid: 1, comm: shepherd, bytes: 8388608 not enough memory for the allocation
2025-09-14 22:00:48 localhost shepherd[1]: Exception caught while calling action of timer 'log-rotation': (system-error "primitive-fork" "~A" ("Cannot allocate memory") (12))
2025-09-22 19:06:33 localhost shepherd[1]: Stopping service root...
2025-09-22 19:06:33 localhost shepherd[1]: Exiting shepherd...
2025-09-22 19:06:33 localhost shepherd[1]: Service guix-ownership is not running.
2025-09-22 19:06:33 localhost shepherd[1]: Service user-homes is not running.
2025-09-22 19:06:33 localhost shepherd[1]: Stopping service swap-7cb6821e-5fbb-48b1-85f8-74b4c41e9b7f...
2025-09-22 19:06:33 localhost linux: [2319321.058327] __vm_enough_memory: pid: 1, comm: shepherd, bytes: 2144313344 not enough memory for the allocation
2025-09-22 19:06:33 localhost shepherd[1]: Ignoring error while stopping swap-7cb6821e-5fbb-48b1-85f8-74b4c41e9b7f: (system-error "swapoff" "~S: ~A" ("/dev/vda2" "Cannot allocate memory") (12))
2025-09-22 19:06:33 localhost shepherd[1]: Service swap-7cb6821e-5fbb-48b1-85f8-74b4c41e9b7f might have failed to stop.
2025-09-22 19:06:33 localhost shepherd[1]: Service swap-7cb6821e-5fbb-48b1-85f8-74b4c41e9b7f is now stopped.
2025-09-22 19:06:34 localhost shepherd[1]: Stopping service ntpd...
2025-09-22 19:06:34 localhost ntpd[134]: ntpd exiting on signal 15 (Terminated)
2025-09-22 19:06:34 localhost shepherd[1]: Service ntpd stopped.
2025-09-22 19:06:34 localhost shepherd[1]: Service ntpd is now stopped.
2025-09-22 19:06:34 localhost shepherd[1]: Stopping service ssh-daemon...
2025-09-22 19:06:34 localhost shepherd[1]: Service ssh-daemon stopped.
2025-09-22 19:06:34 localhost shepherd[1]: Service ssh-daemon is now stopped.
2025-09-22 19:06:34 localhost shepherd[1]: Stopping service certbot-certificate-renewal...

--

Closely related issue: For situations just such as this, reboot(8) is
supposed to have an option (conventionally `-f/--force`) which causes it
to issue the reboot system call itself, bypassing init. But the
Shepherd's version of reboot is missing this option.

--

I was already pretty frustrated with Guix System and this memory leak is
the last straw. This server is shortly going to be reformatted with
another distribution. However, I will preserve a disk image in case it
is useful to anyone.

zw
D
D
Dr. Arne Babenhauserheide wrote 5 days ago
(name . Zack Weinberg via Bug reports for GNU Guix)(address . bug-guix@gnu.org)
87jz1qf3qn.fsf@web.de
"Zack Weinberg" via Bug reports for GNU Guix <bug-guix@gnu.org> writes:

Toggle quote (4 lines)
> I left my Guix System-based web server running for 26 days and PID 1 has
> ballooned to consume 75% of all available RAM. Because of this, it can
> no longer fork. Which, in turn, means the system is almost but not
> quite
Toggle quote (6 lines)
> # free
> total used free shared buff/cache available
> Mem: 2020468 1768960 103008 6472 307064 251508
> Swap: 2094056 168268 1925788


You still have almost all swap free, so you should be able to start
programs (though slowly).

What I found, though, is that SSH can get into trouble when cgroups run
out (which happens quickly if you make heavy use of docker).

I regularly delete the unused cgroups then:

find /sys/fs/cgroup/ -depth -type d -name 'c*' | xargs -I {} sudo bash -c 'if test "$(cat {}/pids.current)" -eq 0; then echo {}; cat {}/pids.current; rmdir {}; fi'

Best wishes,
Arne
--
Unpolitisch sein
heißt politisch sein,
ohne es zu merken.
draketo.de
-----BEGIN PGP SIGNATURE-----

iQJEBAEBCAAuFiEE801qEjXQSQPNItXAE++NRSQDw+sFAmjR4FAQHGFybmVfYmFi
QHdlYi5kZQAKCRAT741FJAPD63UUEADFDu9DrXTyNxf12uS29J/RzGNqWBfE59p1
Zu5F+iSe+OLvtuWHnD38F1XBZJwbiy++vNYWhXgLByl10094NuBJuCvzZ5RFyMni
NTxgGqy7TcLD3BTSKp7zaNBzJz/Q86WmuiXiXnPm5RO9bFixkHdyfSqhyFiKM3Ok
iikpfr/XqNP2qRuRBlJLDOXyODSuyAPqT1nd5OMdKy/5EK878EvV1Vi4MF7UPVOw
kBDyFs9F6YVcAZWq7rqV14pbzS4hVf5aiGFxGcdt6dEJ+4l+YhurEsDjb6XiVCwm
YEgdMdRrsToNGQOOHXA66ffZ8iVxV1VFVLmhAFFrtBqge6P5zZSr/tmzKPBMY/LH
h5OW6Jeyjf2tg39hlXuT4LcSfYqsxitNUAxL7L1Hb7fR3gaJgG9N9kAunmIJV9Qj
FxCSZH1c0gKMw1IETak3FxH7J4GnQve3a2N9NTG2UXQjDX5VhI12NEM38JmFC6jv
OXejUZ4/pSJBLysA7EY4aRfzLuIHfFHBQlIN5Q4FOJYkBo0wvkBcggVZMqPiw99T
ySXFLeX0bQzLOmk6VdTW670wCyI0trtAvX9wxrwacu2OEqtSbUWSZUwVJHTxFTAC
ZAFQ7ljMBXEcoekYKNQggi76sqLOPcx1NRVPUeJkMg86EKDr14wOEJXzbUaBAKZo
nrBYmIZek4jEBAEBCAAuFiEE3Si95tmHXKvOSosd3M8NswvBBUgFAmjR4FAQHGFy
bmVfYmFiQHdlYi5kZQAKCRDczw2zC8EFSPakBACZR+dEcesW+hvccgQZEhpFv3CJ
+Orf/CWA8HR6jxIQht34X3qzaTQEcbALD5LF2uV+Z+6+hBLxxKg0loxybklD0nwU
iZdIDohE01TO7TV15h9vfiSZd+hwHc4oWx/lP08WsEO1ys4EYsXU/iHwbqfZ5bY/
AOa8nC9bkSRwajZznA==
=xfJU
-----END PGP SIGNATURE-----

L
L
Ludovic Courtès wrote 4 days ago
(name . Zack Weinberg via Bug reports for GNU Guix)(address . bug-guix@gnu.org)
874istzn4l.fsf@gnu.org
Hi Zack,

"Zack Weinberg" via Bug reports for GNU Guix <bug-guix@gnu.org> writes:

Toggle quote (4 lines)
> I left my Guix System-based web server running for 26 days and PID 1 has
> ballooned to consume 75% of all available RAM. Because of this, it can
> no longer fork.

This is being tracked at

It would seem a workaround is to use Inetutils syslogd instead of the
built-in ‘system-log’:

Toggle snippet (8 lines)
(operating-system
;; …
(services (append (list …
(service syslog-service-type))
(modify-services %base-services
(delete shepherd-system-log-service-type)))))

Ludo’.
D
D
Dr. Arne Babenhauserheide wrote 4 days ago
(name . Zack Weinberg via Bug reports for GNU Guix)(address . bug-guix@gnu.org)
87ecrxfvt5.fsf@web.de
"Dr. Arne Babenhauserheide" <arne_bab@web.de> writes:

Toggle quote (7 lines)
> "Zack Weinberg" via Bug reports for GNU Guix <bug-guix@gnu.org> writes:
>> I left my Guix System-based web server running for 26 days and PID 1 has
>> ballooned to consume 75% of all available RAM. Because of this, it can
>> no longer fork. Which, in turn, means the system is almost but not

> You still have almost all swap free, so you should be able to start

I have to take back this comment: didn’t read closely enough.
(forking copies allocated memory, so 75% mem usage kills fork)

I’m sorry for the noise.

Best wishes,
Arne
--
Unpolitisch sein
heißt politisch sein,
ohne es zu merken.
draketo.de
-----BEGIN PGP SIGNATURE-----

iQJEBAEBCAAuFiEE801qEjXQSQPNItXAE++NRSQDw+sFAmjSUjcQHGFybmVfYmFi
QHdlYi5kZQAKCRAT741FJAPD6zJcEADZvtFZitFsquFhXh9CLaqz8ClkUosflCUY
rKJKTTZlRghCTA1pIM8ea90yrt0xtw4qxK1IhGJ00EyU7PnNMPtxr7+fp96IRpaB
5JgUq/+znldOU2Mvtaie10zbylYPaJ32U6X7FT4mHjCLfqn9KiDzESKIwMXkLR5j
JumYu1tD5ias2Z+34BGx7ge2Do2ow+DTUwTDEuE4uPoZJrhXm2XQ/DHm/hIzf1rD
ZRNm7yPnn8xFEprsqRYRqXman4RmKwYWCTzE7k9OCAyLzDzdyz2296Oo1826AGDZ
lg8bQ6IzBxTwP86lPkbW6m3e2RGvioP3RNRzTkZxgMAhx0uxNL4V1lxPzm30q/iA
c5d/I7xDFBufVpte4Mu8OedrHp/9ECd4s4/IfCdVJM1hJf3GwFrOgGpkXML/irxr
AQUZz41m7s3rI7SVs15mVwz2AwAZzAgSzaNvVWmHmZSkeI6BaqV9Ma4HYBorOqo9
u17QQZxCIWZD/lR4kxgCnQCSyT0prSu0P4xFC+QpgL2Efx6BgPRw+A7I7XYJs/Wz
kSwVHupW0F/CNdptlaWoNVqj5VkNvf4FBETSAXnSupUhYqJ1eFG5fKgkRG7pycD9
OWLdhY2b0HCmTOpyQ4M15zm3hsvSau3PxUSY8p7hS4u3lKs3zjd1ULivuz6JPuNl
2NkTwzDFaIjEBAEBCAAuFiEE3Si95tmHXKvOSosd3M8NswvBBUgFAmjSUjcQHGFy
bmVfYmFiQHdlYi5kZQAKCRDczw2zC8EFSBKkA/4mh1laVIbBQHbuMvp+4HKFBxj1
eENuh0/94c+z7f1nt3kzLw0ChwbmQXofcsseFCwhZXWZfAUXv8wM90cK0aNQFtt6
tp8OKqS6DH1g6fcU16wA9qnEWRfAZy0B2ibk0Qm+jI50yYWpXcDDN+T7zxf1Fcbk
eQcxHp7K3XcSml04wg==
=egj2
-----END PGP SIGNATURE-----

T
T
Tomas Volf wrote 16 hours ago
(name . Ludovic Courtès)(address . ludo@gnu.org)
87ecrshn69.fsf@wolfsden.cz
Hi,

Ludovic Courtès <ludo@gnu.org> writes:

Toggle quote (10 lines)
> It would seem a workaround is to use Inetutils syslogd instead of the
> built-in ‘system-log’:
>
> (operating-system
> ;; …
> (services (append (list …
> (service syslog-service-type))
> (modify-services %base-services
> (delete shepherd-system-log-service-type)))))

Thank you for the suggestion. I will give that a try. I would probably
be a good idea to mention it on the Codeberg issue as well.

Have a nice day,
Tomas

--
There are only two hard things in Computer Science:
cache invalidation, naming things and off-by-one errors.
?
Your comment

Commenting via the web interface is currently disabled.

To comment on this conversation send an email to 79492@patchwise.org

To respond to this issue using the mumi CLI, first switch to it
mumi current 79492
Then, you may apply the latest patchset in this issue (with sign off)
mumi am -- -s
Or, compose a reply to this issue
mumi compose
Or, send patches to this issue
mumi send-email *.patch