GNU bug report logs

#24937 "deleting unused links" GC phase is too slow

PackageSource(s)Maintainer(s)
guix PTS Buildd Popcon
Full log

Message #55 received at 24937@debbugs.gnu.org (full text, mbox, reply):

Received: (at 24937) by debbugs.gnu.org; 9 Nov 2021 14:45:07 +0000
From debbugs-submit-bounces@debbugs.gnu.org Tue Nov 09 09:45:07 2021
Received: from localhost ([127.0.0.1]:33131 helo=debbugs.gnu.org)
	by debbugs.gnu.org with esmtp (Exim 4.84_2)
	(envelope-from <debbugs-submit-bounces@debbugs.gnu.org>)
	id 1mkSN1-0001r7-9h
	for submit@debbugs.gnu.org; Tue, 09 Nov 2021 09:45:07 -0500
Received: from eggs.gnu.org ([209.51.188.92]:45024)
 by debbugs.gnu.org with esmtp (Exim 4.84_2)
 (envelope-from <ludo@gnu.org>) id 1mkSMr-0001qB-VM
 for 24937@debbugs.gnu.org; Tue, 09 Nov 2021 09:45:05 -0500
Received: from [2001:470:142:3::e] (port=42980 helo=fencepost.gnu.org)
 by eggs.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256)
 (Exim 4.90_1) (envelope-from <ludo@gnu.org>)
 id 1mkSMm-000311-MW; Tue, 09 Nov 2021 09:44:52 -0500
DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; d=gnu.org;
 s=fencepost-gnu-org; h=MIME-Version:In-Reply-To:Date:References:Subject:To:
 From; bh=NEsvPAoIboauvgGdWEv4VETmqnJYGYUInlf47meHIs4=; b=iCKj4CS3UpSRDF+ZevUi
 rGzFmFPw3/rdNonQOCY15bb3+iK0K5Zt3aRuDQ07OwuT43tXESR7TsdFZOx7KWzDkhOv3pklrxhri
 4IWHlLlyaZNsndzKGooOtU8M5evaiaLf/0PrJw1sYRhwix3jfHEHlQLobUAuHC1yKB3ZYU5v33hL1
 WflBs6R1zJgtGVW9hPiTMYf4mNk6hEtflOYV9OmIaXy/ev6uNt5Uk7P7nTvEWFgVrtGrKDKwcajhm
 Nd+B3Y+QknRXn/rcJPCVcBJJshoKvvWwsyhK31HkyAd4QTGCI3NSCfofH6Bp5BJg6o8mbuYzvnoyH
 1iaU7at4C0LFaQ==;
Received: from 91-160-117-201.subs.proxad.net ([91.160.117.201]:63044
 helo=ribbon)
 by fencepost.gnu.org with esmtpsa (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256)
 (Exim 4.90_1) (envelope-from <ludo@gnu.org>)
 id 1mkSMm-0005Z1-7B; Tue, 09 Nov 2021 09:44:52 -0500
From: Ludovic Courtès <ludo@gnu.org>
To: 24937@debbugs.gnu.org
Subject: Re: bug#24937: "deleting unused links" GC phase is too slow
References: <87wpg7ffbm.fsf@gnu.org>
Date: Tue, 09 Nov 2021 15:44:49 +0100
In-Reply-To: <87wpg7ffbm.fsf@gnu.org> ("Ludovic Courtès"'s message of "Sun, 13 Nov 2016 18:41:01 +0100")
Message-ID: <87pmr9l76m.fsf@gnu.org>
User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/27.2 (gnu/linux)
MIME-Version: 1.0
Content-Type: multipart/mixed; boundary="=-=-="
X-Spam-Score: -2.3 (--)
X-Debbugs-Envelope-To: 24937
Cc: Maxim Cournoyer <maxim.cournoyer@gmail.com>
X-BeenThere: debbugs-submit@debbugs.gnu.org
X-Mailman-Version: 2.1.18
Precedence: list
List-Id: <debbugs-submit.debbugs.gnu.org>
List-Unsubscribe: <https://debbugs.gnu.org/cgi-bin/mailman/options/debbugs-submit>, 
 <mailto:debbugs-submit-request@debbugs.gnu.org?subject=unsubscribe>
List-Archive: <https://debbugs.gnu.org/cgi-bin/mailman/private/debbugs-submit/>
List-Post: <mailto:debbugs-submit@debbugs.gnu.org>
List-Help: <mailto:debbugs-submit-request@debbugs.gnu.org?subject=help>
List-Subscribe: <https://debbugs.gnu.org/cgi-bin/mailman/listinfo/debbugs-submit>, 
 <mailto:debbugs-submit-request@debbugs.gnu.org?subject=subscribe>
Errors-To: debbugs-submit-bounces@debbugs.gnu.org
Sender: "Debbugs-submit" <debbugs-submit-bounces@debbugs.gnu.org>
X-Spam-Score: -3.3 (---)
[Message part 1 (text/plain, inline)]
Hi!

ludo@gnu.org (Ludovic Courtès) skribis:

> ‘LocalStore::removeUnusedLinks’ traverses all the entries in
> /gnu/store/.links and calls lstat(2) on each one of them and checks
> ‘st_nlink’ to determine whether they can be deleted.
>
> There are two problems: lstat(2) can be slow on spinning disks as found
> on hydra.gnu.org, and the algorithm is proportional in the number of
> entries in /gnu/store/.links, which is a lot on hydra.gnu.org.

Taking a step back, we could perhaps mitigate this with heuristics to
reduce the number of entries in .links:

  1. Do not deduplicate files with a size lower than some threshold;

  2. Delete links with st_nlink <= 3 (instead of <= 2); that would
     prevent *further* deduplication of those files, but they’d already
     have two instances sharing the same inode;

  3. Stop deduplicating once the number of entries in .links has reached
     a certain threshold.

For #1, a key insight is that about 30% of the files actually
deduplicated (in my store, where /gnu/store/.links has 2.2M entries) are
smaller than 1 KiB:

[size-deduplicated.png (image/png, inline)]
[Message part 3 (text/plain, inline)]
… but 85% of them have at most 4 links (thus, saving up to 2 KiB by
deduplicating):

[nlink-small.png (image/png, inline)]
[Message part 5 (text/plain, inline)]
On my laptop, we’re talking about space savings of 325 MiB, a tiny
fraction of my store:

--8<---------------cut here---------------start------------->8---
scheme@(guile-user)> (saved-space (filter (lambda (file)
					    (< (deduplicated-file-size file) 1024))
					  l))
$40 = 325914739
--8<---------------cut here---------------end--------------->8---

Files smaller than 1 KiB represent 35% of the entries in .links:

[size.png (image/png, inline)]
[Message part 7 (text/plain, inline)]
By not deduplicating files smaller than 1 KiB, we’d reduce the number of
entries by 35%, which should already have a tangible impact on
performance.  It’d be a “mitigation” more than a “fix”, but it has a
good work/reward ratio.

We could conduct a similar analysis for #2.

#3 is more difficult to implement because you cannot know the number of
entries in .links until you’ve traversed it (note that currently
deduplication stops when link(2) returns ENOSPC in .links).

I’m attaching the script I’ve used for that, derived from an earlier
experiment¹.  You’re welcome to give it a spin!

Thoughts?

Ludo’.

¹ https://lists.gnu.org/archive/html/guix-devel/2014-09/msg00422.html

[chart-deduplication.scm (text/plain, inline)]
(use-modules (charting)
             ((guix store) #:select (%store-prefix))
             (ice-9 ftw)
             (ice-9 match)
             (srfi srfi-1)
             (srfi srfi-9))

(define-record-type <deduplicated-file>
  (deduplicated-file name size links)
  deduplicated-file?
  (name    deduplicated-file-name)
  (size    deduplicated-file-size)
  (links   deduplicated-file-link-count))

(define %links-directory
  (string-append (%store-prefix) "/.links"))

(define (links)
  "Return a list of <deduplicated-file>."
  (file-system-fold (const #t)
                    (lambda (file stat result)      ;leaf
                      (cons (deduplicated-file file
                                               (stat:size stat)
                                               (stat:nlink stat))
                            result))
                    (lambda (directory stat result) ;down
                      result)
                    (lambda (directory stat result) ;up
                      result)
                    (const #f)                      ;skip
                    (lambda (file stat errno result)
                      (error "i/o error" file errno))
                    '()
                    %links-directory
                    lstat))

(define KiB (expt 2 10))
(define MiB (* KiB KiB))
(define GiB (* KiB MiB))

(define (saved-space files)
  "Return the total amount of saved space given FILES, a list of
<deduplicated-file>."
  (fold (lambda (df result)
          (match df
            (($ <deduplicated-file> name size links)
             (when (< links 2)
               (error "too few links" name links))
             (+ result (* size (- links 2))))))
        0
        files))

(define (cumulative-distribution files property)
  "Return a list of (VALUE . COUNT) pairs representing the number of FILES
whose PROPERTY is VALUE or less."
  (define (file<? df1 df2)
    (< (property df1) (property df2)))

  (fold (lambda (df result)
          (match result
            (((value . count) . rest)
             (let ((current (property df)))
               (if (= value current)
                   (alist-cons value (+ count 1) rest)
                   (alist-cons current (+ count 1) result))))
            (_
             (alist-cons (property df) 1 result))))
        '()
        (sort files file<?)))

(define* (plot-distribution distribution output
                            #:key (subtitle "SUBTITLE") max-x
                            (group-name "GROUP") x-axis-label)
  "Plot DISTRIBUTION, and produce file OUTPUT."
  (define (log2 n)
    (let loop ((result 1))
      (if (zero? (ash n (- result)))
          (- result 1)
          (loop (+ 1 result)))))

  (define (format-log2-tick tick)
    ;; (string-append "2^"
    ;;                (number->string (log2 (inexact->exact tick))))
    (number->string (inexact->exact tick)))

  (define (adjust-items total)
    (lambda (x)
      (match x
        ;; XXX: Filter out the two cases that would give us a numerical
        ;; overflow.
        ((0 . _) #f)
        ((1 . _) #f)
        ((value . count)
         (and (or (not max-x) (< value max-x))
              (cons value (* 100. (/ count total))))))))

  (match distribution
    (((_ . total) . rest)
     (let ((percent (filter-map (adjust-items total) distribution)))
       (make-scatter-plot #:title (string-append "Cumulative distribution by "
                                                 subtitle)
                          #:data `((,group-name ,@percent))
                          #:x-axis-label x-axis-label
                          #:y-axis-label "%"
                          #:tick-label-formatter format-log2-tick
                          #:log-x-base 2
                          #:min-x 1
                          #:max-y 101
                          #:write-to-png output)))))

#! Examples
(define l (links))  ;this is the expensive part
(plot-distribution (cumulative-distribution l deduplicated-file-link-count)
                   "/tmp/nlink.png" #:x-axis-label "number of hard links"
                   #:subtitle "hard link count" #:max-x 2048
                   #:group-name "nlinks")

(plot-distribution (cumulative-distribution
                    (filter (lambda (file)
                              (< (deduplicated-file-size file) 1024))
                            l)
                    deduplicated-file-link-count)
                   "/tmp/nlink-small.png" #:x-axis-label "number of hard links"
                   #:subtitle "hard link count for files < 1KiB" #:max-x 2048
                   #:group-name "nlinks")


(plot-distribution (cumulative-distribution l deduplicated-file-size)
                   "/tmp/size.png" #:x-axis-label "file size"
                   #:subtitle "file size" #:max-x 32768
                   #:group-name "size (B)")

(plot-distribution (cumulative-distribution
                    (filter (lambda (f)
                              (> (deduplicated-file-link-count f) 2))
                            l)
                    deduplicated-file-size)
                   "/tmp/size-deduplicated.png" #:x-axis-label "file size" #:subtitle
                   "size for files actually deduplicated" #:max-x 32768
                   #:group-name "size (B)")

(plot-distribution (cumulative-distribution
                    (filter (lambda (file)
                              (< (deduplicated-file-size file) 1024))
                            l)
                    (lambda (file)
                      (* (deduplicated-file-size file)
                         (- (deduplicated-file-link-count file) 2))))
                   "/tmp/size-savings.png" #:x-axis-label "savings" #:subtitle
                   "savings for files < 1KiB" #:max-x 32768
                   #:group-name "savings (B)")
!#

Send a report that this bug log contains spam.


debbugs.gnu.org maintainers <help-debbugs@gnu.org>. Last modified: Sun Sep 7 11:44:40 2025; Machine Name: wallace-server

GNU bug tracking system

Debbugs is free software and licensed under the terms of the GNU Public License version 2. The current version can be obtained from https://bugs.debian.org/debbugs-source/.

Copyright © 1999 Darren O. Benham, 1997,2003 nCipher Corporation Ltd, 1994-97 Ian Jackson, 2005-2017 Don Armstrong, and many other contributors.