swhscheduler@scheduler0:~$ swh scheduler --config-file /etc/softwareheritage/scheduler.yml task add load-nixguix url=http://guix.gnu.org/sources.jsonWARNING:swh.core.cli:Could not load subcommand storage: No module named 'swh.journal'INFO:swh.core.config:Loading config file /etc/softwareheritage/scheduler.ymlCreated 1 tasksTask 1224855 Next run: just now (2020-07-06 14:03:59+00:00) Interval: 1 day, 0:00:00 Type: load-nixguix Policy: recurring Args: Keyword args: url: 'http://guix.gnu.org/sources.json'
From irc ping:
15:39 <zimoun> lewo: Hi, the sources.json for Guix is served at http://guix.gnu.org/sources.json so could you add to your staging for testing?15:40 <+olasd> nice!15:40 <+olasd> ardumont: ^15:41 <lewo> zimoun: yeah! Cool ;)15:42 <+ardumont> ack on this, nice!15:42 <+ardumont> lemme finish my stuff first ;)
What you are missing is that archive is the production.
I only ran the loader on the guix source into staging to expose problems if any (well, there were and they are fixed now ;)
Note: status uneventful with a different snapshot is kinda unexpected for me. Not something drastically problematic though. I'll dig in at some point.
@ardumont: did you load the same sources.json? Because http://guix.gnu.org/sources.json is refreshed every X hours and some stats of the commits after 2018-12-05 (v0.16.0) says mean at 21 and median at 13, both per day. And since loading requires ~1h15min, you need some luck to read the same son file twice.
Speaking about stats, the first test took 31961.72341875988s which is ~9h and the second 4497.450984489056s which is ~1h15mn. The difference comes from that the 2nd time, the loader skips a lot of sources because there are already ingested, right? However, between the 2 tests, only few sources have changed so I am a bit doubtful that it takes so "long". Does it mean that this ~1h+ is almost trying to download non-.{tar,gz,bz2} files as .gem files which obviously fail like @lewo has explained?
yes.
The scheduled origin to load is the sources.json url you gave to me.
So that's what the loader is regularly passing on (expectedtly that fill changes over time ;).
Because http://guix.gnu.org/sources.json is refreshed every X hours and some stats of the commits after 2018-12-05 (v0.16.0) says mean at 21 and median at 13, both per day.
And since loading requires ~1h15min, you need some luck to read the same son file twice.
ok.
Speaking about stats, the first test took 31961.72341875988s which is ~9h and the second 4497.450984489056s which is ~1h15mn. The difference comes from that the 2nd time, the loader skips a lot of sources because there are already ingested, right?
Yes.
However, between the 2 tests, only few sources have changed so I am a bit doubtful that it takes so "long". Does it mean that this ~1h+ is almost trying to download non-.{tar,gz,bz2} files as .gem files which obviously fail like @lewo has explained?
Most probably, yes.
There is no filtering on a per extension basis within the loader itself.
As mentioned in swh/devel/swh-loader-core#1991 (closed), there is still room for improvments ;)
loader-core 0.9.0 which includes swh/devel/swh-loader-core#2510 (closed) improvment got deployed on staging to see if that improves time/performance.
(both run for guix and nix sources)
I'll check when i'm getting back (heading for some rest now ;)
Seems to have reduced the cost (from ~4500s to ~1500s) but there might still be margin for improvments [1]
For one, the extensions to skip were not finely analyzed (from the top of my head, we could add ".el' extensions to filter out for example).