Stuck builds in Cuirass

  • Done
  • quality assurance status badge
Details
3 participants
  • Ludovic Courtès
  • Marius Bakke
  • Mathieu Othacehe
Owner
unassigned
Submitted by
Marius Bakke
Severity
normal

Debbugs page

M
M
Marius Bakke wrote on 23 Nov 2022 04:50
(address . bug-guix@gnu.org)
87tu2pzvfo.fsf@gnu.org
Hi,

Cuirass has a tendency to not notice when a build is finished, leaving
it in a "running" state.

The phenomenon can be observed by going to
https://ci.guix.gnu.org/status and look at builds that are running for
a suspiciously long time.

Typically the build log will indicate that it has finished, yet Cuirass
is patiently waiting...and not scheduling further builds.

Restarting the builds typically get things going again.

I wrote a nasty script to automatically restart builds that are running
for >1 hour, but it's not a sustainable solution:
#!/usr/bin/env python3

# Restart stuck builds.... TODO fix cuirass properly.

import requests
from bs4 import BeautifulSoup
import re

builds_html = requests.get(builds_page).text

soup = BeautifulSoup(builds_html, "html5lib")
main = soup.find('main', {'id': 'content'})
table = main.find('table')

result = {}

for row in table.find_all('tr'):
data = row.find_all('td')
if len(data) > 0:
build_id = row.find('a').contents[0]
name = data[0].contents[0]
age = data[1].contents[0]
system = data[2].contents[0]
log = data[3]

result[build_id] = {'name': name, 'age': age, 'system': system}

age_re = re.compile("(\d+) (\w+) ago")
restart = []

for id in result.keys():
age = result[id]['age']
match = age_re.match(result[id]['age'])
if match is not None: # "seconds ago"
digits = match.group(1)
time_unit = match.group(2)
if time_unit == "hours":
restart.append(id)
elif time_unit == "minutes" and int(digits) > 60:
restart.append(id)

certificate_file = "/home/marius/tmp/mbakke.cert.pem"
certificate_key = "/home/marius/tmp/mbakke.key.pem"

import time

print(f"Found {len(restart)} stuck builds..!")

for id in restart:
print(f"Going to restart {result[id]['name']} ({id}, running since {result[id]['age']})...")
cert=(certificate_file, certificate_key))
time.sleep(3)
-----BEGIN PGP SIGNATURE-----

iIUEARYKAC0WIQRNTknu3zbaMQ2ddzTocYulkRQQdwUCY34XGw8cbWFyaXVzQGdu
dS5vcmcACgkQ6HGLpZEUEHcYCQD/WbYxZ+Mi1I4kYSCKqRmuVrucf7oVXlZwAyFT
KHhbOrQA/jUT3vZCpeiiSPWyxedXqYOBllkcvQXgmT3tj4RPcZMH
=pDj4
-----END PGP SIGNATURE-----

M
M
Mathieu Othacehe wrote on 23 Nov 2022 05:26
(name . Marius Bakke)(address . marius@gnu.org)(address . 59514@debbugs.gnu.org)
874jupx0md.fsf@gnu.org
Hello Marius,

Toggle quote (7 lines)
> Cuirass has a tendency to not notice when a build is finished, leaving
> it in a "running" state.
>
> The phenomenon can be observed by going to
> <https://ci.guix.gnu.org/status> and look at builds that are running for
> a suspiciously long time.

I suspect this is caused by https://issues.guix.gnu.org/59510which
causes the worker threads to bail out.

We can probably merge those two issues. The
/var/log/cuirass-remote-server.log file on Berlin also indicates when
the build-succeeded or build-failed message is received by the server,
and how long the fetch from the worker took.

Thanks,

Mathieu
L
L
Ludovic Courtès wrote on 14 Jul 14:43 -0700
control message for bug #59514
(address . control@debbugs.gnu.org)
87r0bvy7id.fsf@gnu.org
tags 59514 wontfix
close 59514
quit
?
Your comment

This issue is archived.

To comment on this conversation send an email to 59514@patchwise.org

To respond to this issue using the mumi CLI, first switch to it
mumi current 59514
Then, you may apply the latest patchset in this issue (with sign off)
mumi am -- -s
Or, compose a reply to this issue
mumi compose
Or, send patches to this issue
mumi send-email *.patch