Test gitea lister on staging environment
The gitea lister should be configured on the staging environment and tested with a task to list the codeberg.org forge.
(!) The task limit have to be increased in improve the listing speed (#2313 (closed))
Migrated from T2577 (view on Phabricator)
Activity
-
Newest first Oldest first
-
Show all activity Show comments only Show history only
- Vincent Sellier mentioned in merge request !155 (closed)
mentioned in merge request !155 (closed)
- Vincent Sellier mentioned in merge request !156 (closed)
mentioned in merge request !156 (closed)
- Vincent Sellier mentioned in merge request !157 (closed)
mentioned in merge request !157 (closed)
- Phabricator Migration user marked this issue as related to #2313 (closed)
marked this issue as related to #2313 (closed)
- Vincent Sellier added Lister label
added Lister label
- Vincent Sellier assigned to @vsellier
assigned to @vsellier
- Vincent Sellier added state:wip label
added state:wip label
- Phabricator Migration user mentioned in commit swh/infra/puppet/puppet-swh-site@b51b0ceb
mentioned in commit swh/infra/puppet/puppet-swh-site@b51b0ceb
- Antoine R. Dumont added priority:Normal label
added priority:Normal label
- Antoine R. Dumont changed the description
changed the description
- Author Maintainer
-
task-type
registered :
swhscheduler@scheduler0:/etc/softwareheritage/backend$ swh scheduler --config-file /etc/softwareheritage/scheduler.yml task-type register -p lister.gitea WARNING:swh.core.cli:Could not load subcommand storage: No module named 'swh.journal' INFO:swh.core.config:Loading config file /etc/softwareheritage/scheduler.yml INFO:swh.scheduler.cli.task_type:Loading entrypoint for plugin lister.gitea INFO:swh.scheduler.cli.task_type:Create task type list-gitea-full in scheduler INFO:swh.scheduler.cli.task_type:Create task type list-gitea-incremental in scheduler
-
- Author Maintainer
- The data model does't need to be created because it was already done in swh/infra/sysadm-environment#2358 (closed)
- The task is created :
swhscheduler@scheduler0:~$ swh scheduler --config-file /etc/softwareheritage/scheduler.yml task add --policy oneshot list-gitea-full url=https://codeberg.org/api/v1/ limit=100 WARNING:swh.core.cli:Could not load subcommand storage: No module named 'swh.journal' INFO:swh.core.config:Loading config file /etc/softwareheritage/scheduler.yml Created 1 tasks Task 1263805 Next run: just now (2020-09-09 07:25:40+00:00) Interval: 90 days, 0:00:00 Type: list-gitea-full Policy: oneshot Args: Keyword args: limit: 100 url: 'https://codeberg.org/api/v1/'
swh-scheduler=# select * from task where type like '%gitea%'; -[ RECORD 1 ]----+------------------------------------------------------------------------------ id | 1263805 type | list-gitea-full arguments | {"args": [], "kwargs": {"url": "https://codeberg.org/api/v1/", "limit": 100}} next_run | 2020-09-09 07:25:40.025668+00 current_interval | 90 days status | next_run_scheduled policy | oneshot retries_left | 0 priority |
I'm just waiting for the validation of swh/infra/puppet/puppet-swh-site!221 (closed) to activate the tasks.
- Author Maintainer
For info, on my desktop with the docker environment, with a limit of 100, the lister takes 3s to list the complete codeberg forge :
swh-lister_1 | [2020-09-08 18:33:19,259: INFO/ForkPoolWorker-1] Task swh.lister.gitea.tasks.RangeGiteaLister[363e0b30-b13a-4f62-bd31-9847dfe62450] succeeded in 3.7196799100056523s: {'status': 'eventful'}
There is 3508 repositories detected.
- Author Maintainer
the configuration is deployed and the listers were restarted.
The initial listing failed due to a concurrency problem. The problem is logged in sentry here : https://sentry.softwareheritage.org/share/issue/aec9c2af347e47ea84f51ace3bfe2f25/
It looks similar to #2070 (closed)
- Author Maintainer
I have tested to create a list-gitea-incremental task but it fails to but this time with another exception relative to an unexpected "sort" parameter : https://sentry.softwareheritage.org/share/issue/b0119b56f24347bcb58ac28c68685c62/
swhscheduler@scheduler0:/etc/softwareheritage/backend$ swh scheduler --config-file /etc/softwareheritage/scheduler.yml task add --policy oneshot list-gitea-incremental url=https://codeberg.org/api/v1/ limit=100 WARNING:swh.core.cli:Could not load subcommand storage: No module named 'swh.journal' INFO:swh.core.config:Loading config file /etc/softwareheritage/scheduler.yml Created 1 tasks Task 1267302 Next run: just now (2020-09-09 09:40:12+00:00) Interval: 1 day, 0:00:00 Type: list-gitea-incremental Policy: oneshot Args: Keyword args: limit: 100 url: 'https://codeberg.org/api/v1/'
- Author Maintainer
The concurrency issue was reproduced locally on the docker environment with a concurrency of 5.
It seems the pages are listed several times during the job execution :
swh-lister_1 | [2020-09-09 14:04:05,742: INFO/ForkPoolWorker-4] listing repos starting at 10 swh-lister_1 | [2020-09-09 14:04:06,052: INFO/ForkPoolWorker-4] listing repos starting at 11 swh-lister_1 | [2020-09-09 14:04:13,819: INFO/ForkPoolWorker-3] listing repos starting at 10 ... swh-lister_1 | [2020-09-09 14:04:05,621: INFO/ForkPoolWorker-1] listing repos starting at 30 swh-lister_1 | [2020-09-09 14:04:05,970: INFO/ForkPoolWorker-1] listing repos starting at 31 swh-lister_1 | [2020-09-09 14:04:10,282: INFO/ForkPoolWorker-2] listing repos starting at 30 swh-lister_1 | [2020-09-09 14:04:10,949: ERROR/ForkPoolWorker-2] Task swh.lister.gitea.tasks.RangeGiteaLister[f25fb95c-fbf3-4ee6-9072-f4029d2d04c1] raised unexpected: IntegrityError('(psycopg2.errors.UniqueViolation) duplicate key value violates unique constraint "gitea_repo_pkey"\nDETAIL: Key (uid)=(3567) already exists.\n')
- Phabricator Migration user mentioned in commit e3c856b5
mentioned in commit e3c856b5
- Phabricator Migration user mentioned in commit 31efda62
mentioned in commit 31efda62
- Author Maintainer
The test of new version v0.1.4 including the fix on the the range split, the uid change and the incremental task fix is ok.
Deployment :
- Database cleanup on db0:
swh-lister=# drop table gitea_repo; DROP TABLE
- Update of the loaders and restart them, from pergamon :
root@pergamon:~# clush -b -w @staging-workers 'apt-get update; apt install -y python3-swh.lister' ... root@pergamon:~# clush -b -w @staging-workers 'dpkg -l python3-swh.lister' --------------- worker[0-2].internal.staging.swh.network (3) --------------- Desired=Unknown/Install/Remove/Purge/Hold | Status=Not/Inst/Conf-files/Unpacked/halF-conf/Half-inst/trig-aWait/Trig-pend |/ Err?=(none)/Reinst-required (Status,Err: uppercase=bad) ||/ Name Version Architecture Description +++-==================-====================-============-================================================================= ii python3-swh.lister 0.1.4-1~swh1~bpo10+1 all Software Heritage Listers (bitbucket, git(lab|hub), pypi, etc...) # restart root@pergamon:~# clush -w @swh-workers -b systemctl restart swh-worker@loader_svn
- Database model upgrade, from scheduler0
root@scheduler0:~# apt-get update && apt install python3-swh.lister ... Unpacking python3-swh.lister (0.1.4-1~swh1~bpo10+1) over (0.1.2-1~swh1~bpo10+1) ... Setting up python3-swh.lister (0.1.4-1~swh1~bpo10+1) swhscheduler@scheduler0:~$ swh lister --db-url postgresql://swh-lister:*****@db0.internal.staging.swh.network:5432/swh-lister db-init
- check on db0 :
swh-lister=# \d gitea_repo Table "public.gitea_repo" Column | Type | Collation | Nullable | Default -------------+-----------------------------+-----------+----------+--------- name | character varying | | | full_name | character varying | | | html_url | character varying | | | origin_url | character varying | | | origin_type | character varying | | | last_seen | timestamp without time zone | | not null | task_id | integer | | | uid | character varying | | not null | instance | character varying | | | Indexes: "gitea_repo_pkey" PRIMARY KEY, btree (uid) "ix_gitea_repo_full_name" btree (full_name) "ix_gitea_repo_instance" btree (instance) "ix_gitea_repo_name" btree (name)
- scheduling of a new full import task, from
scheduler0
:
swhscheduler@scheduler0:~$ swh scheduler --config-file /etc/softwareheritage/scheduler.yml task add --policy oneshot list-gitea-full url=https://codeberg.org/api/v1/ limit=100
- all the repos are correctly imported without errors
swh-lister=# select count(*) from gitea_repo; count ------- 3506 (1 row)
- test of the incremental task :
swhscheduler@scheduler0:~$ swh scheduler --config-file /etc/softwareheritage/scheduler.yml task add --policy oneshot list-gitea-incremental url=https://codeberg.org/api/v1/ limit=100
Sep 10 10:22:16 worker1 python3[273967]: [2020-09-10 10:22:16,897: INFO/MainProcess] Received task: swh.lister.gitea.tasks.IncrementalGiteaLister[023de13b-77a7-4ea3-b768-1600d20d4584] Sep 10 10:22:17 worker1 python3[273977]: [2020-09-10 10:22:17,086: INFO/ForkPoolWorker-4] listing repos starting at 1 Sep 10 10:22:17 worker1 python3[273977]: [2020-09-10 10:22:17,315: INFO/ForkPoolWorker-4] Repositories already seen, stopping Sep 10 10:22:17 worker1 python3[273977]: [2020-09-10 10:22:17,320: INFO/ForkPoolWorker-4] Task swh.lister.gitea.tasks.IncrementalGiteaLister[023de13b-77a7-4ea3-b768-1600d20d4584] succeeded in 0.4189604769926518s: {'status': 'uneventful'}
- Vincent Sellier closed
closed