Project 'infra/sysadm-environment' was moved to 'swh/infra/sysadm-environment'. Please update any links and bookmarks that may still have the old path.
Currently the next gen scheduler does not allow to limit the number of tasks per forge.
So a plan forward would be to allow the listing but prevent the origins from getting
scheduled for ingestion from the actual scheduler cogs running. Then, trigger the
ingestion "manually" [1] with dedicated worker(s) which would consume specifically from
sourceforge and respecting the limits set in the description.
[1] well some script reading from the scheduler and sending the git, svn, hg origins for
ingestion in a queue that worker would consume from [2]
Another idea would be to add the SourceForge origins with enabled=false so they're not picked up by the scheduler, until we've done the first pass on them. This avoids needing to change the scheduler at all.
Dedicated worker17 node got provisionned to make the first run on the sourceforge
origins (svn and git for now). Remains some code to actually schedule the origins we are
interested. And some plumbing to actually consume those messages with respect to the
concurrency defined in the description.
As for the actual listing, its first pass is done [1] [2]:
[1] scheduler info:
softwareheritage-scheduler=> select count(*) from listed_origins lo inner join listers l on l.id=lo.lister_id where l.name='sourceforge'; count-------- 338281(1 row)
[2] worker logs:
Jun 02 07:15:45 worker11 python3[1407482]: [2021-06-02 07:15:45,705: INFO/ForkPoolWorker-4] Task swh.lister.sourceforge.tasks.FullSourceForgeLister[e29c07ff-b01f-4739-a820-1d326e76ad63] succeeded in 83421.67006923165s: {'pages': 258898, 'origins': 338327}
Note: the is a small number discrepancy (46 less in the db) but the order of magnitude
is roughly the same so i guess it's fine.
A small issue was found, @Alphare fixed it ^.
Notification to the sourceforge people about the ingestion starting soon got sent.
In the mean time, existing dataset from the first listing got adapted according to the fix (staging & prod scheduler db) [1] [2]
And the fix got deployed (staging/prod workers restarted).
Now on to actually deploying the dedicated loader and adapted some code to schedule correctly for the proper ingestion scheme.
[1] staging
swh-scheduler=> update listed_origins set url='https://' || url where lister_id='4b19e941-5e25-4cb0-b55d-ae421d983e2f';UPDATE 338223swh-scheduler=>swh-scheduler=> select url from listed_origins lo inner join listers l on l.id=lo.lister_id where l.name='sourceforge' limit 10; url------------------------------------------------------ https://git.code.sf.net/p/new-1/code https://git.code.sf.net/p/root2raj-test/code https://git.code.sf.net/p/kernel-whyred/code https://git.code.sf.net/p/youtuber/code https://git.code.sf.net/p/surnubs/code https://git.code.sf.net/p/podcastliam/code https://git.code.sf.net/p/centos-repos/code https://git.code.sf.net/p/psnidck/git https://hg.code.sf.net/p/psnidck/mercurial https://git.code.sf.net/p/library-software-free/code(10 rows)swh-scheduler=> commit;COMMIT
[2] prod
softwareheritage-scheduler=> update listed_origins set url='https://' || url where lister_id='b678cfc3-2780-4186-9186-d78a14bd4958';UPDATE 338281softwareheritage-scheduler=> select url from listed_origins lo inner join listers l on l.id=lo.lister_id where l.name='sourceforge' limit 10; url------------------------------------------------- https://bzr.code.sf.net/p/abandonedlands/code https://bzr.code.sf.net/p/adchppgui/code https://bzr.code.sf.net/p/admos/bazaar https://bzr.code.sf.net/p/afros-update/bazaar https://bzr.code.sf.net/p/alternityshadow/code https://bzr.code.sf.net/p/amyunix2/bazaar https://bzr.code.sf.net/p/anamnesis/code https://bzr.code.sf.net/p/anubisstegano/code https://bzr.code.sf.net/p/apreta/code https://bzr.code.sf.net/p/arabicontology/bazaar
Migration of the mercurial origins dataset from https to http (scheduler prod/staging
in progress) [1]
Incremental lister deployed
[1] staging (roughly the same went for production)
swh-scheduler=> update listed_origins set url=replace(url, 'https://', 'http://') where lister_id='4b19e941-5e25-4cb0-b55d-ae421d983e2f' and url like 'https://%' and visit_type='hg';UPDATE 27489Time: 994.239 msswh-scheduler=> select count(*) from listed_origins lo inner join listers l on l.id=lo.lister_id where l.name='sourceforge' and lo.visit_type='hg' and lo.url like 'https://%';+-------+| count |+-------+| 0 |+-------+(1 row)Time: 81.923 msswh-scheduler=> select count(*) from listed_origins lo inner join listers l on l.id=lo.lister_id where l.name='sourceforge' and lo.visit_type='hg' and lo.url like 'http://%';+-------+| count |+-------+| 27489 |+-------+(1 row)Time: 69.252 ms
[2] The ingestion only started with a concurrency of 4 and only for the git origins. The
next day, concurrency got bumpted to 6 and the svn origins got thrown in the mix...
[3]
softwareheritage-scheduler=> select visit_type, count(*) from listed_origins lo inner join listers l on l.id=lo.lister_id where l.name='sourceforge' group by visit_type;+------------+--------+| visit_type | count |+------------+--------+| svn | 101624 || hg | 27497 || git | 180319 || cvs | 28622 || bzr | 290 |+------------+--------+(5 rows)Time: 7032.441 ms (00:07.032)softwareheritage=> select now(), count(*) from origin where url like 'https://git.code.sf%';+-------------------------------+-------+| now | count |+-------------------------------+-------+| 2021-06-04 15:13:28.030285+00 | 73540 |+-------------------------------+-------+(1 row)Time: 83736.834 ms (01:23.737)softwareheritage=> select now(), count(*) from origin where url like 'https://svn.code.sf%';+-------------------------------+-------+| now | count |+-------------------------------+-------+| 2021-06-04 15:15:13.416308+00 | 2020 |+-------------------------------+-------+(1 row)Time: 12233.012 ms (00:12.233)
Still running, both svn and git svn origins are ingested regularly.
We are up to 96k origins down now (out of ~280k both svn and git).
The worker17 got reworked a bit to use tmpfs [1] and with an increase in ram (from 32 to
64g).
[1]
softwareheritage=> select now(), count(*) from origin where url like 'https://%.code.sf.net%';+-------------------------------+-------+| now | count |+-------------------------------+-------+| 2021-06-08 07:33:34.084783+00 | 96794 |+-------------------------------+-------+(1 row)Time: 64354.297 ms (01:04.354)softwareheritage=> select now(), count(*) from origin where url like 'https://svn.code.sf.net%';+-------------------------------+-------+| now | count |+-------------------------------+-------+| 2021-06-08 09:12:22.683031+00 | 15274 |+-------------------------------+-------+(1 row)Time: 68655.542 ms (01:08.656)softwareheritage=> select now(), count(*) from origin where url like 'https://git.code.sf.net%';+-------------------------------+-------+| now | count |+-------------------------------+-------+| 2021-06-08 09:11:46.340185+00 | 81522 |+-------------------------------+-------+(1 row)Time: 105006.727 ms (01:45.007)
[2] The disk io pattern is quite aggressive due to the svn loader implementation. That change was well received by the machine ;) as can be seen in the following graph [3].