Project 'infra/sysadm-environment' was moved to 'swh/infra/sysadm-environment'. Please update any links and bookmarks that may still have the old path.
It's apparently [1] often flushing new objects to add to the storage.
So a bump in the buffer configuration should help [2].
Also lots of time is spent running the to_model/from_dict [1] stanza.
As most of the cpu are usage full [3] bumping the machine's cpus to
something higher should help.
10:05:31 softwareheritage@belvedere:5432=> select now(), count(distinct url) from origin o inner join origin_visit ov on o.id=ov.origin where o.url like 'https://bitbucket.org/%' and ov.type='hg';+------------------------------+--------+| now | count |+------------------------------+--------+| 2021-09-09 08:05:31.90589+00 | 266803 |+------------------------------+--------+(1 row)Time: 173117.724 ms (02:53.118)16:45:59 softwareheritage@belvedere:5432=> select now(), count(distinct url) from origin o inner join origin_visit ov on o.id=ov.origin where o.url like 'https://bitbucket.org/%' and ov.type='hg';+-------------------------------+--------+| now | count |+-------------------------------+--------+| 2021-09-14 14:51:18.684494+00 | 271282 |+-------------------------------+--------+(1 row)Time: 175008.966 ms (02:55.009)
Another run in on a large repository (which cannot finish, the error is independent
though) [1]
Without the patches now deployed in production, the failure would have taken around the
same amount each time. From now one, the first round takes a long time, whether it
finishes or not. As extid and revisions are stored along the way, the next ingestion
round will be faster since they lift the extid mappings newly stored.
|----------------------+--------------------|| First round | 2nd ||----------------------+--------------------|| real 1030m32.795s | real 12m35.640s ||----------------------+--------------------|
Actually restarted the loader_oneshot which now makes usage of the latest v2.2.0 loader mercurial.
Datapoint:
12:24:21 softwareheritage@belvedere:5432=> select now(), count(distinct url) from origin o inner join origin_visit ov on o.id=ov.origin where o.url like 'https://bitbucket.org/%' and ov.type='hg';+-------------------------------+--------+| now | count |+-------------------------------+--------+| 2021-09-17 10:24:22.953431+00 | 273002 |+-------------------------------+--------+(1 row)Time: 217785.472 ms (03:37.785)
Now most of the time can be spent in reading the actual mapping extids -> hgnode-id [1] to filter on something we already see.
Which does not change much from actual visits which already ended up in snapshot.
However that changes a lot for visits on forks where we can bypass already done work on those forks.
Note that there is still some mappings from_dict calls happening, that comes from reading in the storage.
As the filtering is still happening client side, i gather that applying the last patch which would make actually
filtering server side would drop some more spurious work.
10:02:40 softwareheritage@belvedere:5432=> select now(), count(distinct url) from origin o inner join origin_visit ov on o.id=ov.origin where o.url like 'https://bitbucket.org/%' and ov.type='hg';+------------------------------+--------+| now | count |+------------------------------+--------+| 2021-09-20 10:04:30.89072+00 | 280995 |+------------------------------+--------+(1 row)Time: 223465.755 ms (03:43.466)
fwiw, the rabbitmq queue is 237683 (was around 280k on friday prior to the v2.2 so the ingestion of those went way faster).
This might be a better heuristic than the current origins (as some hg origins already existed prior to actually ingest the bitbucket backup...).
I've patched the systemd swh-worker@loader_oneshot to actually lift --autoscale 10,20
from celery cli. It's actually holding fine. And that coupled with the filtering server
side makes for a huge bump in speed. The archive db does not seem to mind at all.
18:02:11 softwareheritage@belvedere:5432=> select now(), count(distinct url) from origin o inner join origin_visit ov on o.id=ov.origin where o.url like 'https://bitbucket.org/%' and ov.type='hg';+-------------------------------+--------+| now | count |+-------------------------------+--------+| 2021-09-20 16:02:13.517665+00 | 282491 |+-------------------------------+--------+(1 row)Time: 176714.410 ms (02:56.714)
I've patched the systemd swh-worker@loader_oneshot to actually lift --autoscale 10,20
saam's nfs server had a bit of a hard time with some many read (the backup through nfs) and write (objstorage).
So i decreased back the concurrency to 10 as before.
Anyway, it's now as fast as it can be.
So closing this now.