#24937 - "deleting unused links" GC phase is too slow

Package	Source(s)		Maintainer(s)
guix		PTS Buildd Popcon

Message #22 received at 24937@debbugs.gnu.org (full text, mbox, reply):

Received: (at 24937) by debbugs.gnu.org; 11 Dec 2016 19:27:41 +0000
From debbugs-submit-bounces@debbugs.gnu.org Sun Dec 11 14:27:41 2016
Received: from localhost ([127.0.0.1]:38429 helo=debbugs.gnu.org)
	by debbugs.gnu.org with esmtp (Exim 4.84_2)
	(envelope-from <debbugs-submit-bounces@debbugs.gnu.org>)
	id 1cG9mf-0006pp-7K
	for submit@debbugs.gnu.org; Sun, 11 Dec 2016 14:27:41 -0500
Received: from world.peace.net ([50.252.239.5]:44137)
 by debbugs.gnu.org with esmtp (Exim 4.84_2)
 (envelope-from <mhw@netris.org>) id 1cG9md-0006pc-Fc
 for 24937@debbugs.gnu.org; Sun, 11 Dec 2016 14:27:39 -0500
Received: from [10.1.10.104] (helo=jojen)
 by world.peace.net with esmtpsa (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256)
 (Exim 4.84_2) (envelope-from <mhw@netris.org>)
 id 1cG9mX-00075t-PJ; Sun, 11 Dec 2016 14:27:33 -0500
From: Mark H Weaver <mhw@netris.org>
To: ludo@gnu.org (Ludovic Courtès)
Subject: Re: bug#24937: "deleting unused links" GC phase is too slow
References: <87wpg7ffbm.fsf@gnu.org> <87lgvm4lzu.fsf@gnu.org>
 <87twaaa6j9.fsf@netris.org> <87twaa2vjx.fsf@gnu.org>
Date: Sun, 11 Dec 2016 14:27:33 -0500
In-Reply-To: <87twaa2vjx.fsf@gnu.org> ("Ludovic
 \=\?utf-8\?Q\?Court\=C3\=A8s\=22'\?\=
 \=\?utf-8\?Q\?s\?\= message of "Sun, 11 Dec 2016 19:02:42 +0100")
Message-ID: <87lgvm9sgq.fsf@netris.org>
User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/25.1 (gnu/linux)
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: quoted-printable
X-Spam-Score: 0.0 (/)
X-Debbugs-Envelope-To: 24937
Cc: 24937@debbugs.gnu.org
X-BeenThere: debbugs-submit@debbugs.gnu.org
X-Mailman-Version: 2.1.18
Precedence: list
List-Id: <debbugs-submit.debbugs.gnu.org>
List-Unsubscribe: <https://debbugs.gnu.org/cgi-bin/mailman/options/debbugs-submit>, 
 <mailto:debbugs-submit-request@debbugs.gnu.org?subject=unsubscribe>
List-Archive: <https://debbugs.gnu.org/cgi-bin/mailman/private/debbugs-submit/>
List-Post: <mailto:debbugs-submit@debbugs.gnu.org>
List-Help: <mailto:debbugs-submit-request@debbugs.gnu.org?subject=help>
List-Subscribe: <https://debbugs.gnu.org/cgi-bin/mailman/listinfo/debbugs-submit>, 
 <mailto:debbugs-submit-request@debbugs.gnu.org?subject=subscribe>
Errors-To: debbugs-submit-bounces@debbugs.gnu.org
Sender: "Debbugs-submit" <debbugs-submit-bounces@debbugs.gnu.org>
X-Spam-Score: 0.0 (/)

ludo@gnu.org (Ludovic Courtès) writes:

> Mark H Weaver <mhw@netris.org> skribis:
>
>> I think we should sort the entire directory using merge sort backed to
>> disk files.  If we load chunks of the directory, sort them and process
>> them individually, I expect that this will increase the amount of I/O
>> required by a non-trivial factor.  In each pass, we would load blocks of
>> inodes from disk, almost all of which are likely to be present in the
>> store and thus linked from the directory, but in this scheme we will
>> process only a small number of them and drop the rest on the floor to be
>> read again in the next pass.  Given that even my fairly optimal
>> implementation takes about 35 minutes to run on Hydra, I'd prefer to
>> avoid multiplying that by a non-trivial factor.
>
> Sure, though it’s not obvious to me how much of a difference it makes;
> my guess is that processing in large chunks is already a win, but we’d
> have to measure.

I agree, it would surely be a win.  Given that it currently takes on the
order of a day to run this phase on Hydra, if your proposed method takes
2 hours, that would be a huge win, but still not good, IMO.  Even 35
minutes is slower than I'd like.

>> Why not just use GNU sort?  It already exists, and does exactly what we
>> need.
>
> Does ‘sort’ manage to avoid reading whole files in memory?

Yes, it does.  I monitored the 'sort' process when I first ran my
optimized pipeline.  It created about 10 files in /tmp, approximately 70
megabytes each as I recall, and then read them all concurrently while
writing the sorted output.

My guess is that it reads a manageable chunk of the input, sorts it in
memory, and writes it to a temporary file.  I guess it repeats this
process, writing multiple temporary files, until the entire input is
consumed, and then reads all of those temporary files, merging them
together into the output stream.

>> If you object to using an external program for some reason, I would
>> prefer to re-implement a similar algorithm in the daemon.
>
> Yeah, I’d rather avoid serializing the list of file names/inode number
> pairs just to invoke ‘sort’ on that.

Sure, I agree that it would be better to avoid that, but IMO not at the
cost of using O(N) memory instead of O(1) memory, nor at the cost of
multiplying the amount of disk I/O by a non-trivial factor.

> Also, what algorithm are you referring to?

The algorithm I described above, which I guess is close to what GNU sort
does.

    Thanks,
      Mark

Display info messages

Send a report that this bug log contains spam.

debbugs.gnu.org maintainers <help-debbugs@gnu.org>. Last modified: Sun Sep 7 12:02:43 2025; Machine Name: wallace-server

GNU bug tracking system

Debbugs is free software and licensed under the terms of the GNU Public License version 2. The current version can be obtained from https://bugs.debian.org/debbugs-source/.

#24937 "deleting unused links" GC phase is too slow