Knot: Linker runs very slowly and crashes during build

  • Done
  • quality assurance status badge
Details
3 participants
  • Ludovic Courtès
  • Tobias Geerinckx-Rice
  • Simon South
Owner
unassigned
Submitted by
Simon South
Severity
normal

Debbugs page

S
S
Simon South wrote on 4 Oct 2020 13:56
(address . bug-guix@gnu.org)
87a6x1g17f.fsf@simonsouth.net
Building Knot 3.0.0 using "guix build knot" consistently appears to hang
for me when it gets to this point during the linking stage:

CCLD knsec3hash
ar: `u' modifier ignored since `D' is the default (see `U')
CCLD kdig
CCLD khost

While it sits here the compiler is tying up 100% of a single CPU
core. On my ROCK64 with 4 GB of RAM, it eventually crashes with an
internal error:

gcc: internal compiler error: Killed (program cc1)
Please submit a full bug report,
with preprocessed source if appropriate.
See https://gcc.gnu.org/bugs/ for instructions.
make[3]: *** [Makefile:5381: libzscanner/la-scanner.lo] Error 1
make[3]: Leaving directory '/tmp/guix-build-knot-3.0.0.drv-0/knot-3.0.0/src'

dmesg shows the compiler was killed for running out of memory:

cc1 invoked oom-killer: gfp_mask=0x100cca(GFP_HIGHUSER_MOVABLE), order=0, oom_score_adj=0
CPU: 2 PID: 22340 Comm: cc1 Not tainted 5.8.11-gnu #1
(...)
oom-kill:constraint=CONSTRAINT_NONE,nodemask=(null),cpuset=/,mems_allowed=0,global_oom,task_memcg=/,task=cc1,pid=22340,uid=999
Out of memory: Killed process 22340 (cc1) total-vm:2573780kB, anon-rss:2540708kB, file-rss:0kB, shmem-rss:0kB, UID:999 pgtables:5044kB oom_score_adj:0
oom_reaper: reaped process 22340 (cc1), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB

On my x86_64 machine the build eventually completes (that machine has
much more memory), but there is the same, weirdly long delay during
linking while the compiler runs.

I see no such delay however when I build the code "manually", using
"guix environment --pure knot" or even "guix environment --no-grafts
--container knot" as the manual suggests. The build then completes
quickly and successfully on either machine; the problem appears to
happen only when guix-daemon is involved.

Is there a known issue that can cause the linker to consume orders of
magnitude more resources when run by the Guix build process?

Apart from rebuilding gcc with debugging symbols (which seems to make
Guix want to rebuild every other package in the system as well) and
trying to understand what the compiler is doing, how might I go about
diagnosing this?

--
Simon South
simon@simonsouth.net
S
S
Simon South wrote on 4 Oct 2020 16:01
(address . 43802@debbugs.gnu.org)
87mu11egul.fsf@simonsouth.net
So naturally, as soon as I submit the bug report something occurs to me
that gets me unstuck.

The delay and crash are occuring while libtool is using gcc to compile
src/libzscanner/scanner.c, which appears to be generated at build time
from the file scanner.c.t0 in the same directory.

When I build Knot on my own, scanner.c has a size of 272 KB. When guix
builds it, scanner.c somehow balloons out to 1.9 MB! So naturally gcc is
going to need some time and space to make its way through all that code.

In fact the build process actually points out

NOTE: Compilation of scanner.c can take several minutes!

So perhaps all this is completely expected. Still... 1.9 MB. Of C
code. It's tempting to think something is going wrong here. (And anyway,
why the huge discrepancy in file size?)

I'm investigating.

--
Simon South
simon@simonsouth.net
S
S
Simon South wrote on 4 Oct 2020 17:09
(address . 43802@debbugs.gnu.org)
878scledol.fsf@simonsouth.net
Turns out this is not a bug. Knot ships with two parser implementations:
A smaller, slower one (272 KB) and a larger, faster one (1.9 MB). The
larger one is a bit too big to build reliably on systems with 4 GB or
less of available memory.

To test Knot on these machines, you can run "configure" with
"--disable-fastparser" as an argument (or edit gnu/packages/dns.scm to
do so) to force it to use the smaller parser. This also allows the build
to complete more quickly on systems that can use either.

So how was I getting the smaller implementation in my own builds without
realizing it? The configure script has some magical behaviour: It will
automatically select the faster-building implementation if it finds a
".git" folder in the current directory. This is presumably meant to help
developers, but the confusion it caused me demonstrates why I think this
sort of magical programming is bad practice.

At any rate, this bug report can be closed.

--
Simon South
simon@simonsouth.net
S
S
Simon South wrote on 4 Oct 2020 17:16
(address . control@debbugs.gnu.org)
87362tedcn.fsf@simonsouth.net
tags 43802 + notabug
close 43802
thanks

--
Simon South
simon@simonsouth.net
L
L
Ludovic Courtès wrote on 5 Oct 2020 07:15
Re: bug#43802: Knot: Linker runs very slowly and crashes during build
(name . Simon South)(address . simon@simonsouth.net)(address . 43802@debbugs.gnu.org)
87ft6swyg6.fsf@gnu.org
Hi,

Simon South <simon@simonsouth.net> skribis:

Toggle quote (32 lines)
> Building Knot 3.0.0 using "guix build knot" consistently appears to hang
> for me when it gets to this point during the linking stage:
>
> CCLD knsec3hash
> ar: `u' modifier ignored since `D' is the default (see `U')
> CCLD kdig
> CCLD khost
>
> While it sits here the compiler is tying up 100% of a single CPU
> core. On my ROCK64 with 4 GB of RAM, it eventually crashes with an
> internal error:
>
> gcc: internal compiler error: Killed (program cc1)
> Please submit a full bug report,
> with preprocessed source if appropriate.
> See <https://gcc.gnu.org/bugs/> for instructions.
> make[3]: *** [Makefile:5381: libzscanner/la-scanner.lo] Error 1
> make[3]: Leaving directory '/tmp/guix-build-knot-3.0.0.drv-0/knot-3.0.0/src'
>
> dmesg shows the compiler was killed for running out of memory:
>
> cc1 invoked oom-killer: gfp_mask=0x100cca(GFP_HIGHUSER_MOVABLE), order=0, oom_score_adj=0
> CPU: 2 PID: 22340 Comm: cc1 Not tainted 5.8.11-gnu #1
> (...)
> oom-kill:constraint=CONSTRAINT_NONE,nodemask=(null),cpuset=/,mems_allowed=0,global_oom,task_memcg=/,task=cc1,pid=22340,uid=999
> Out of memory: Killed process 22340 (cc1) total-vm:2573780kB, anon-rss:2540708kB, file-rss:0kB, shmem-rss:0kB, UID:999 pgtables:5044kB oom_score_adj:0
> oom_reaper: reaped process 22340 (cc1), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
>
> On my x86_64 machine the build eventually completes (that machine has
> much more memory), but there is the same, weirdly long delay during
> linking while the compiler runs.

I this an LTO build (with ‘-flto’ in the compile and link flags)? That
could explain the memory requirements.

Ludo’.
T
T
Tobias Geerinckx-Rice wrote on 5 Oct 2020 08:26
(address . 43802@debbugs.gnu.org)
875z7oae3z.fsf@nckx
Simon,

Would it make sense to provide a faster-building slower-starting
Knot variant alongside the main package?

Ludovic Courtès 写道:
Toggle quote (4 lines)
> I this an LTO build (with ‘-flto’ in the compile and link
> flags)? That
> could explain the memory requirements.

No, but good guess.

Simon South 写道:
Toggle quote (2 lines)
> Turns out this is not a bug.

The fast parser is written in Ragel[0], which compiles down to
almost 2 MiB of ‘C’, which is then thrown at GCC to sort out. I
know to put the kettle on before hacking on Knot locally.

What I didn't know was that these generated C files were included
in the release tarball. We have the Ragel, we can rebuild them,
and we now do so in commit
2b73e50c31a61b5dcef35a1e4b9484d9dbcb0fbc. Thanks for bringing it
to my attention.

Kind regards,

T G-R

-----BEGIN PGP SIGNATURE-----

iIMEARYKACsWIQT12iAyS4c9C3o4dnINsP+IT1VteQUCX3s7EA0cbWVAdG9iaWFz
LmdyAAoJEA2w/4hPVW152oIA/2zGSRD4p40y3uklz/gKMRrHDRb2MQt46wU+XCTJ
s1dxAP0ZCaevCB9eldjoWHL/cISxBOyZAExsFryqkyxW/0PlCg==
=y1uw
-----END PGP SIGNATURE-----

S
S
Simon South wrote on 5 Oct 2020 08:44
(name . Tobias Geerinckx-Rice)(address . me@tobias.gr)
87o8lgbrtq.fsf@simonsouth.net
Tobias Geerinckx-Rice <me@tobias.gr> writes:
Toggle quote (3 lines)
> Would it make sense to provide a faster-building slower-starting Knot
> variant alongside the main package?

I'm inclined to say "no", especially if we assume a substitute will
(nearly always) be available.

Unless someone is hacking on the scanner directly it ought to be safe to
add "--disable-fastparser" to dns.scm temporarily during testing, then
remove it before submitting a patch. If it isn't then probably _that_ is
the bug to be fixed.

Toggle quote (4 lines)
> What I didn't know was that these generated C files were included in
> the release tarball. We have the Ragel, we can rebuild them, and we
> now do so in commit 2b73e50c31a61b5dcef35a1e4b9484d9dbcb0fbc.

Neat!

--
Simon South
simon@simonsouth.net
L
L
Ludovic Courtès wrote on 7 Oct 2020 15:06
(name . Simon South)(address . simon@simonsouth.net)
87o8ldsnc4.fsf@gnu.org
Simon South <simon@simonsouth.net> skribis:

Toggle quote (6 lines)
>> What I didn't know was that these generated C files were included in
>> the release tarball. We have the Ragel, we can rebuild them, and we
>> now do so in commit 2b73e50c31a61b5dcef35a1e4b9484d9dbcb0fbc.
>
> Neat!

+1, yay for bootstrapping!

Ludo’.
?
Your comment

This issue is archived.

To comment on this conversation send an email to 43802@patchwise.org

To respond to this issue using the mumi CLI, first switch to it
mumi current 43802
Then, you may apply the latest patchset in this issue (with sign off)
mumi am -- -s
Or, compose a reply to this issue
mumi compose
Or, send patches to this issue
mumi send-email *.patch