Compute and display distribution of origins by forge
This meta-task tracks the activities related to computing and displaying the distribution of sources by forge/source code provider. This involves:
- identifying the forges/source code providers (easy for regularly crawled ones, more tricky for the save code now entries)
- finding an efficient way of maintaining a counter of sources per forge/source code provider (HyperLogLog again?)
- setting up an API entry point to get this information
- displaying this information in a nice readable way on archive.softwareheritage.org (maybe on a dedicated page); options:
- pie chart (beware, GitHub may use up all the space, so the info will be of little use)
- sorted list (from bigger to smaller), maybe in a scrollable widget 20 lines high
Some related work has already been done and was tracked in #1463 and #1500 (closed) (now closed, why?)
Migrated from T3127 (view on Phabricator)
- Show closed items
- swh/infra/sysadm-environment #3402
Activity
-
Newest first Oldest first
-
Show all activity Show comments only Show history only
- Roberto Di Cosmo mentioned in issue swh-lister#3442 (closed)
mentioned in issue swh-lister#3442 (closed)
- Roberto Di Cosmo mentioned in merge request swh-scheduler!243 (closed)
mentioned in merge request swh-scheduler!243 (closed)
- Roberto Di Cosmo mentioned in merge request swh-counters!19 (closed)
mentioned in merge request swh-counters!19 (closed)
- Roberto Di Cosmo mentioned in merge request !592 (closed)
mentioned in merge request !592 (closed)
- Roberto Di Cosmo mentioned in merge request !593 (closed)
mentioned in merge request !593 (closed)
- Roberto Di Cosmo added Metrics/monitoring Roadmap 2021 Web app meta-task labels
added Metrics/monitoring Roadmap 2021 Web app meta-task labels
- vlorentz added priority:Normal label
added priority:Normal label
- vlorentz assigned to @anlambert
assigned to @anlambert
- Maintainer
After some analysis, the data we need to properly implement this are:
- the set of lister names and their instance names in order to organize origins by forge types (gitlab, cgit, sourceforge, ...)
- a precise or estimated count for the origins listed by a given lister instance
Getting the set of listers and their instances can be done with a simple query to the scheduler database:
softwareheritage-scheduler=> select name, instance_name from listers order by name; CRAN | cran GNU | GNU bitbucket | bitbucket cgit | alpinelinux cgit | git.gnu.org.ua cgit | zx2c4 cgit | tor cgit | hdiff.luite cgit | gnu-savannah cgit | openembedded cgit | git.joeyh.name cgit | baserock cgit | git-kernel cgit | fedora cgit | qt.io cgit | eclipse cgit | yoctoproject debian | Debian debian | Debian-Security gitea | git.fsfe.org gitea | codeberg.org github | github gitlab | riseup gitlab | lip6 gitlab | inria gitlab | freedesktop gitlab | ow2 gitlab | common-lisp gitlab | gnome gitlab | gite.lirmm gitlab | gitlab gitlab | framagit launchpad | launchpad npm | npm phabricator | swh phabricator | wikimedia phabricator | blender phabricator | llvm phabricator | kde pypi | pypi save-code-now | archive.softwareheritage.org sourceforge | main
To get the count of loaded origins for a given lister instance, the best solution from my point of view is to extend
swh-counters
features by processing the URLs from the origin topic ofswh-journal
. It has the advantage to also process origins submitted through the save code now service.For the record, I made a little experiment yersteday by hacking on
swh-counters
code and adding the following code processing origins:def process_origins(origins: Dict[bytes, bytes], counters: Redis): origins_netloc = defaultdict(set) for origin_bytes in origins.values(): origin = msgpack.loads(origin_bytes) parsed_url = urlparse(origin["url"]) netloc = parsed_url.netloc if netloc.endswith("googlecode.com"): netloc = "googlecode.com" origins_netloc[netloc].add(origin["url"]) for k, v in origins_netloc.items(): counters.add(k, v)
I used the following config for
swh-counters
:counters: cls: redis host: localhost journal: brokers: - kafka1.internal.softwareheritage.org - kafka2.internal.softwareheritage.org - kafka3.internal.softwareheritage.org - kafka4.internal.softwareheritage.org prefix: swh.journal.objects group_id: anlambert.origin_counts.dev4 object_types: - origin batch_size: 1000
I then processed all origins from production archive with the following command:
$ swh counters -C ~/.config/swh/counters.yml journal-client
And this is the estimated counters (HyperLogLog based) obtained, sorted in descending order of number of origins:
b'github.com' 156394620 b'bitbucket.org' 2128683 b'www.npmjs.com' 1679889 b'gitlab.com' 1023330 b'googlecode.com' 790026 b'pypi.org' 325025 b'gitorious.org' 122014 b'git.code.sf.net' 115484 b'svn.code.sf.net' 62191 b'Debian' 38533 b'salsa.debian.org' 33665 b'snapshot.debian.org' 32911 b'git.launchpad.net' 19435 b'framagit.org' 18803 b'cran.r-project.org' 17899 b'hdiff.luite.com' 13843 b'gitlab.gnome.org' 9076 b'gitlab.freedesktop.org' 4755 b'gitlab.inria.fr' 3905 b'codeberg.org' 3632 b'git.savannah.gnu.org' 2970 b'git.baserock.org' 2920 b'anongit.kde.org' 2499 b'phabricator.wikimedia.org' 2236 b'code.google.com' 2230 b'git.kernel.org' 2067 b'fedorapeople.org' 1699 b'ftp.gnu.org' 1579 b'scm.gforge.inria.fr' 1234 b'gitlab.ow2.org' 1120 b'Debian-Security' 1031 b'phabricator.kde.org' 1021 b'0xacab.org' 1018 b'git.torproject.org' 1017 b'gitlab.common-lisp.net' 782 b'www.softwareheritage.org' 741 b'gitlab.riscosopen.org' 528 b'gite.lirmm.fr' 470 b'gricad-gitlab.univ-grenoble-alpes.fr' 447 b'git.alpinelinux.org' 378 b'forgemia.inra.fr' 364 b'git.fsfe.org' 352 b'code.qt.io' 325 b'git.zx2c4.com' 294 b'plmlab.math.cnrs.fr' 288 b'git.renater.fr' 274 b'sourcesup.renater.fr' 274 b'subversion.renater.fr' 229 b'scm.sourcesup.renater.fr' 228 b'git.unistra.fr' 223 b'forge.softwareheritage.org' 174 b'hg.tryton.org' 174 b'git.yoctoproject.org' 169 b'hal.archives-ouvertes.fr' 167 b'opendev.org' 162 b'gitlab.huma-num.fr' 143 b'git.php.net' 113 b'gitlab.irstea.fr' 112 b'doi.org' 98 b'gitlab.adullact.net' 82 b'git.ik.bme.hu' 78 b'git.gnu.org.ua' 77 b'git.joeyh.name' 76 b'forge.univ-lyon1.fr' 75 b'git.libreoffice.org' 69 b'git.eclipse.org' 65 b'forge.grandlyon.com' 63 b'git.sr.ht' 49 b'gitlab.u-psud.fr' 46 b'developer.blender.org' 38 b'git.agesic.gub.uy' 37 b'source.netsurf-browser.org' 37 b'gitbox.apache.org' 36 b'dci-gitlab.cines.fr' 30 b'git.openembedded.org' 30 b'gopkg.in' 30 b'notabug.org' 30 b'git-wip-us.apache.org' 28 b'gitub.u-bordeaux.fr' 27 b'dev.ch-poitiers.fr' 26 b'code.ill.fr' 24 b'gitlab.orfeo-toolbox.org' 24 b'git.sch.bme.hu' 21 b'gitlab.math.unistra.fr' 21 b'edugit.org' 20 b'gist.github.com' 20 b'repo.or.cz' 18 b'foss.heptapod.net' 17 b'gitlabjf.ccomptes.fr' 16 b'forge.frm2.tum.de' 15 b'gitlab.cern.ch' 15 b'git.archlinux.org' 14 b'git.bde-insa-lyon.fr' 14 b'git.neodarz.net' 14 b'sourceware.org' 14 b'edu-git.ac-versailles.fr' 13 b'evilpiepirate.org' 13 b'git.kpe.io' 12 b'git.savannah.nongnu.org' 12 b'gitlab.cerema.fr' 12 b'gitlab.lip6.fr' 11 b'gitlab.xiph.org' 11 b'code.briarproject.org' 10 b'git.ricketyspace.net' 10 b'git.sesse.net' 10 b'gitlab.developers.cam.ac.uk' 10 b'gitlab.fing.edu.uy' 10 b'gogs.librecmc.org' 10 b'hg.code.sf.net' 10 b'hg.libsdl.org' 10 b'software.intel.com' 10 b'android.googlesource.com' 9 b'forge.extranet.logilab.fr' 9 b'git.beta.pole-emploi.fr' 9 b'git.rockbox.org' 9 b'git.singpolyma.net' 9 b'git.unicaen.fr' 9 b'gitlab.oit.duke.edu' 9 b'hg.icculus.org' 9 b'git.ademe.fr' 8 b'git.elephly.net' 8 b'git.infradead.org' 8 b'gitlab.alpinelinux.org' 8 b'gitlab.redox-os.org' 8 b'sourceforge.net' 8 b'anongit.freedesktop.org' 7 b'gerrit.googlesource.com' 7 b'inria.halpreprod.archives-ouvertes.fr' 7 b'review.coreboot.org' 7 b'git.hadrons.org' 6 b'git.pleroma.social' 6 b'gitlab.dune-project.org' 6 b'gitlab.onelab.info' 6 b'gitweb.torproject.org' 6 b'pagure.io' 6 b'spivey.oriel.ox.ac.uk' 6 b'svn.blender.org' 6 b'www.happyassassin.net' 6 b'GitHub.com' 5 b'anonscm.debian.org' 5 b'art1pirat.spdns.org' 5 b'code.videolan.org' 5 b'ec.europa.eu' 5 b'git.blender.org' 5 b'git.enlightenment.org' 5 b'git.mfiano.net' 5 b'git.osmocom.org' 5 b'git.suckless.org' 5 b'go.googlesource.com' 5 b'hal-preprod.archives-ouvertes.fr' 5 b'invent.kde.org' 5 b'mainstream.inf.elte.hu' 5 b'secure.phabricator.com' 5 b'source.puri.sm' 5 b'svn.apache.org' 5 b'svn.linuxfromscratch.org' 5 b'www.home.marutan.net' 5 b'crux.nu' 4 b'git.linux-nfs.org' 4 b'git.netfilter.org' 4 b'git.progress-linux.org' 4 b'git.sv.gnu.org' 4 b'git.zap.org.au' 4 b'hg.logilab.org' 4 b'hg.sr.ht' 4 b'jff.email' 4 b'jugit.fz-juelich.de' 4 b'legacy.helldragon.eu' 4 b'svn.icculus.org' 4 b'www.github.com' 4 b'bzr.ed.am' 3 b'chromium.googlesource.com' 3 b'git.cbaines.net' 3 b'git.dthompson.us' 3 b'git.freebsd.org' 3 b'git.ghostscript.com' 3 b'git.gnu.io' 3 b'git.lepiller.eu' 3 b'git.linuxfromscratch.org' 3 b'git.loetlabor-jena.de' 3 b'git.osdn.net' 3 b'git.pengutronix.de' 3 b'git.pofilo.fr' 3 b'git.startinblox.com' 3 b'git.tuxfamily.org' 3 b'git.zrythm.org' 3 b'gitlab.aei.uni-hannover.de' 3 b'gitlab.in2p3.fr' 3 b'gitlab.linphone.org' 3 b'hg.osdn.net' 3 b'inqlab.net' 3 b'libregit.org' 3 b'source.winehq.org' 3 b'svn.jdownloader.org' 3 b'svn.wildfiregames.com' 3 b'trac.wildfiregames.com' 3 b'www.cs.unm.edu' 3 b'' 2 b'archive.bologna.enea.it' 2 b'atlassian@bitbucket.org' 2 b'bazaar.launchpad.net' 2 b'cgit.freedesktop.org' 2 b'g.iterate.ch' 2 b'gcc.gnu.org' 2 b'git.2f30.org' 2 b'git.busybox.net' 2 b'git.codesynthesis.com' 2 b'git.coolaj86.com' 2 b'git.dpkg.org' 2 b'git.easter-eggs.org' 2 b'git.ffmpeg.org' 2 b'git.gnome.org' 2 b'git.hcoop.net' 2 b'git.ikilote.net' 2 b'git.libssh.org' 2 b'git.maneage.org' 2 b'git.ngyro.com' 2 b'git.openprivacy.ca' 2 b'git.opensvc.com' 2 b'git.osgeo.org' 2 b'git.ring0.de' 2 b'git.rip' 2 b'git.sagemath.org' 2 b'git.samba.org' 2 b'git.synz.io' 2 b'git.systemreboot.net' 2 b'git.theobroma-systems.com' 2 b'git.videolan.org' 2 b'gitbio.ens-lyon.fr' 2 b'gitea.petton.fr' 2 b'gitlab.gwdg.de' 2 b'gitlab.haskell.org' 2 b'gitlab.inf.elte.hu' 2 b'gitlab.isc.org' 2 b'gitlab.mbb.univ-montp2.fr' 2 b'gitlab.mim-libre.fr' 2 b'gitlab.nic.cz' 2 b'gitlab.opengeosys.org' 2 b'gitlab.univ-lr.fr' 2 b'gvipers.imt-lille-douai.fr' 2 b'hal-test.archives-ouvertes.fr' 2 b'hg.mozilla.org' 2 b'hg.nginx.org' 2 b'hg.openjdk.java.net' 2 b'hg.reportlab.com' 2 b'jxself.org' 2 b'lab.louiz.org' 2 b'launchpad.net' 2 b'muddlers.org' 2 b'people.freedesktop.org' 2 b'plugins.svn.wordpress.org' 2 b'profs.scienze.univr.it' 2 b'public-inbox.org' 2 b'repo.hu' 2 b'reviews.llvm.org' 2 b'scm.adullact.net' 2 b'sr.ht' 2 b'sunshinegardens.org' 2 b'svn.savannah.gnu.org' 2 b'svn.thedarkmod.com' 2 b'taylorhakes@github.com' 2 b'tinc-vpn.org' 2 b'voidpoint.io' 2 b'www.cl.cam.ac.uk' 2 b'www.davidsharp.com' 2 b'84.38.177.154' 1 b'abcl.org' 1 b'aomedia.googlesource.com' 1 b'argouml-spl.tigris.org' 1 b'boringssl.googlesource.com' 1 b'bos.seul.org' 1 b'bunnyhero@bitbucket.org' 1 b'buttslol.net' 1 b'c9x.me' 1 b'cm-gitlab.stanford.edu' 1 b'code-repo.d4science.org' 1 b'code.9front.org' 1 b'code.divoplade.fr' 1 b'code.gab.com' 1 b'code.heb12.com' 1 b'code.launchpad.net' 1 b'code.librehq.com' 1 b'code.research.uts.edu.au' 1 b'code.reversed.top' 1 b'ctp2.darkdust.net' 1 b'depp.brause.cc' 1 b'dev.ds-servers.com' 1 b'dev.hostsharing.net' 1 b'dmitri.shuralyov.com' 1 b'dpdk.org' 1 b'dthompson.us' 1 b'dtrebbien@bitbucket.org' 1 b'eldargab@github.com' 1 b'etckeeper.branchable.com' 1 b'filfox.info' 1 b'floppsie.comp.glam.ac.uk' 1 b'forge.clermont-universite.fr' 1 b'foundry.openuru.org' 1 b'framgit.org' 1 b'galexander.org' 1 b'genome-source.gi.ucsc.edu' 1 b'geopsy.org' 1 b'git-annex.branchable.com' 1 b'git-tails.immerda.ch' 1 b'git.0pointer.de' 1 b'git.alsa-project.org' 1 b'git.ardour.org' 1 b'git.assembla.com' 1 b'git.beyermatthi.as' 1 b'git.bouncycastle.org' 1 b'git.centos.org' 1 b'git.clfs.org' 1 b'git.dev.opencascade.org' 1 b'git.dgit.debian.org' 1 b'git.drobilla.net' 1 b'git.e2factory.org' 1 b'git.ebc.li' 1 b'git.embl.de' 1 b'git.foldling.org' 1 b'git.gnunet.org' 1 b'git.gnupg.org' 1 b'git.guilhem.org' 1 b'git.haiku-os.org' 1 b'git.hypra.fr' 1 b'git.ikiwiki.info' 1 b'git.imp.fu-berlin.de' 1 b'git.in-silico.ch' 1 b'git.in-ulm.de' 1 b'git.interior.edu.uy' 1 b'git.jami.net' 1 b'git.kernel.dk' 1 b'git.kyleam.com' 1 b'git.lekensteyn.nl' 1 b'git.ligo.org' 1 b'git.linaro.org' 1 b'git.liw.fi' 1 b'git.lysator.liu.se' 1 b'git.matrix.org' 1 b'git.meli.delivery' 1 b'git.minetest.land' 1 b'git.mpi-cbg.de' 1 b'git.musl-libc.org' 1 b'git.neil.brown.name' 1 b'git.net-core.org' 1 b'git.netsurf-browser.org' 1 b'git.nzoss.org.nz' 1 b'git.open-music-kontrollers.ch' 1 b'git.openldap.org' 1 b'git.openssl.org' 1 b'git.openstack.org' 1 b'git.parat.swiss' 1 b'git.plexbak.nl' 1 b'git.postgresql.org' 1 b'git.proxmox.com' 1 b'git.psyced.org' 1 b'git.pwmt.org' 1 b'git.qemu.org' 1 b'git.qsomula.top' 1 b'git.schottelius.org' 1 b'git.scilab.org' 1 b'git.sdaoden.eu' 1 b'git.sdf.org' 1 b'git.simple-cc.org' 1 b'git.spwhitton.name' 1 b'git.strongswan.org' 1 b'git.sv.nongnu.org' 1 b'git.teknik.io' 1 b'git.tiker.net' 1 b'git.toastfreeware.priv.at' 1 b'git.trustedfirmware.org' 1 b'git.tukaani.org' 1 b'git.vuxu.org' 1 b'git.wow.st' 1 b'git.xiph.org' 1 b'git.zvx8.com' 1 b'git@github.com' 1 b'gitea.eponym.info' 1 b'github.com.cnpmjs.org' 1 b'gitlab.caltech.edu' 1 b'gitlab.ccsd.cnrs.fr' 1 b'gitlab.denx.de' 1 b'gitlab.echothree.com' 1 b'gitlab.exascale-computing.eu' 1 b'gitlab.huawei.com' 1 b'gitlab.irap.omp.eu' 1 b'gitlab.labs.nic.cz' 1 b'gitlab.mister-muffin.de' 1 b'gitlab.mpi-sws.org' 1 b'gitlab.obspm.fr' 1 b'gitlab.omofumi.pl' 1 b'gitlab.petton.fr' 1 b'gitlab.rlp.net' 1 b'gitlab.savoirfairelinux.com' 1 b'gitlab.uni.lu' 1 b'gitweb.dragonflybsd.org' 1 b'gn.googlesource.com' 1 b'gnomint.git.sourceforge.net' 1 b'gnunet.org' 1 b'graphics.rwth-aachen.de:9000' 1 b'guix.gnu.org' 1 b'hg.lilotux.net' 1 b'hg.mozdev.org' 1 b'hg.savannah.gnu.org' 1 b'hih-git.neurologie.uni-tuebingen.de' 1 b'hub.darcs.net' 1 b'ikiwiki.branchable.com' 1 b'jelmer.uk' 1 b'joinup.ec.europa.eu' 1 b'juigitlab.esac.esa.int' 1 b'kernel.googlesource.com' 1 b'keysafe.branchable.com' 1 b'kylheku.com' 1 b'lab.jerasure.org' 1 b'lab.nexedi.com' 1 b'linux-libre.fsfla.org' 1 b'linuxtv.org' 1 b'llvm.org' 1 b'mbb-team.github.io' 1 b'mcabber.com' 1 b'myrepos.branchable.com' 1 b'nix-community.github.io' 1 b'nsz.repo.hu:49100' 1 b'opencircuitdesign.com' 1 b'opensource.ieee.org' 1 b'plugins.trac.wordpress.org' 1 b'pumpa.branchable.com' 1 b'r-36.net' 1 b'resources.oreilly.com' 1 b'review.haiku-os.org' 1 b'riscosopen.org' 1 b'schierlm@git.code.sf.net' 1 b'scm.osdn.net' 1 b'scribus.net' 1 b'software.legiasoft.com' 1 b'source.joinmastodon.org' 1 b'src.fedoraproject.org' 1 b'sre.ring0.de' 1 b'stoyokaramihalev.github.io' 1 b'svn.appwork.org' 1 b'svn.eby-sarna.com' 1 b'svn.filezilla-project.org' 1 b'svn.freebsd.org' 1 b'svn.kibibyte.se' 1 b'svn.osdn.net' 1 b'svn.r-project.org' 1 b'svn.savannah.nongnu.org' 1 b'svn.science.uu.nl' 1 b'svn.so-much-stuff.com' 1 b'svn.tuxfamily.org' 1 b'svn.xvid.org' 1 b'svn.zoy.org' 1 b'thelambdalab.xyz' 1 b'thingshare.ion.nu' 1 b'tildegit.org' 1 b'timorleste.github.io' 1 b'tomakehurst@github.com' 1 b'tuleap.net' 1 b'vicerveza.homeunix.net' 1 b'wenshao@github.com' 1 b'www.aleph1.co.uk' 1 b'www.codesrc.com' 1 b'www.fsfla.org' 1 b'www.gitlab.com' 1 b'www.ipol.im' 1 b'www.kermitproject.org' 1 b'www.kylheku.com' 1 b'www.mercurial-scm.org' 1 b'www.mitsuba-renderer.org' 1 b'www.octave.org' 1 b'www.riverbankcomputing.com' 1 b'www.xtideuniversalbios.org' 1 b'xenbits.xenproject.org' 1 b'xpra.org' 1 b'youpibouh.thefreecat.org' 1 b'zimbra-mirror@bitbucket.org' 1
It should be easy to map a lister instance name to one of these counters and produces automatic display of those data in the webapp. For those with ambiguities, we can still provide a manual mapping for some edge cases. We can also get the numbers of origins not linked to a lister in production (gitorious, googlecode, ...).
Anymore thoughts ?
- Maintainer
Great work! Awesome.
Anymore thoughts ? It should be easy to map a lister instance name to one of these counters and produces automatic display of those data in the webapp.
Regarding this, to ease the mapping between a lister and an instance name, we may want to rework the instance names in the scheduler model (listers table) so that the value is actually the netloc of the origin. Re-attaching the origin to a lister will then a simple matter of checking the netloc (except for some exception like googlecode you already in the snippet above)
My 2 cents.
- Maintainer
Regarding this, to ease the mapping between a lister and an instance name, we may want to rework the instance names in the scheduler model (listers table) so that the value is actually the netloc of the origin.
That would be awesome and simplify code to write a lot ! +1
- Author Maintainer
Nice to see this moving forward!
These entries in the counter log look suspicious, though, they are not origins:
b'atlassian@bitbucket.org' 2 b'taylorhakes@github.com' 2 b'bunnyhero@bitbucket.org' 1 b'dtrebbien@bitbucket.org' 1 b'eldargab@github.com' 1 b'git@github.com' 1 b'schierlm@git.code.sf.net' 1 b'tomakehurst@github.com' 1 b'wenshao@github.com' 1 b'zimbra-mirror@bitbucket.org' 1
- Maintainer
! In #3127 (closed), @rdicosmo wrote: Nice to see this moving forward!
These entries in the counter log look suspicious, though, they are not origins:
b'atlassian@bitbucket.org' 2 b'taylorhakes@github.com' 2 b'bunnyhero@bitbucket.org' 1 b'dtrebbien@bitbucket.org' 1 b'eldargab@github.com' 1 b'git@github.com' 1 b'schierlm@git.code.sf.net' 1 b'tomakehurst@github.com' 1 b'wenshao@github.com' 1 b'zimbra-mirror@bitbucket.org' 1
Those correspond to origins submitted through save code now requests, all of them ended up not found of course.
- Phabricator Migration user mentioned in commit swh-scheduler@7f51f274
mentioned in commit swh-scheduler@7f51f274
- Phabricator Migration user mentioned in commit swh-counters@cd595e71
mentioned in commit swh-counters@cd595e71
- Phabricator Migration user marked this issue as related to swh/infra/sysadm-environment#3402 (closed)
marked this issue as related to swh/infra/sysadm-environment#3402 (closed)
- Maintainer
For information, discussing with @olasd, he reminded me that we had already a cli entrypoint [1] to compute stats about what we want scheduler side.
What's missing implementation wise would be to expose an endpoint to actually display said information.
So, the question is, even though the implementation swh.counter started, do we really want that there or this ^ scheduler side would be enough?
- Maintainer
Sorry @anlambert, I was late at Monday's meeting and I completely missed this in your weekly plan, I would have pointed this out earlier.
The existing scheduler metrics are probably not complete enough for all we want to display (we should review them so they are), but the swh.scheduler journal client already gathers all the information needed, so we should be able to compute all that we need from the scheduler tables.
The main pain point is that we do have a bunch of origins for which we don't have a
listed_origins
entry (because they've been archived before the current lister version was deployed, then disappeared, or because we've never listed them in the first place, e.g. for save code now origins).I think we should be able to "backfill" these known but disabled origins in the
listed_origins
tables, once (setting them asenabled=false
so they don't clutter the scheduling). - Maintainer
! In #3127 (closed), @ardumont wrote: @anlambert @rdicosmo
For information, discussing with @olasd, he reminded me that we had already a cli entrypoint [1] to compute stats about what we want scheduler side.
What's missing implementation wise would be to expose an endpoint to actually display said information.
So, the question is, even though the implementation swh.counter started, do we really want that there or this ^ scheduler side would be enough?
For the archive coverage widget in the webapp homepage, I think we should only display the number of origins that were processed by a loader to reflect the current number of archived projects, plus some origins like gitorious or googlecode do not have lister metrics in scheduler database.
For the lister metrics extracted from the scheduler, we could add a different widget displaying those after adding a new endpoint to the scheduler interface to easily get the data.
- Maintainer
After more thoughts about all those metrics, we could revamp the coverage widget into two tabs:
- one tab displaying metrics about loaded origins with detailed counts by forge and links to search interface to browse them
- one tab displaying metrics about listed origins from the data extracted from the scheduler database
The idea is to show what we have archived so far and what is planned to be archived or saved again.