gitlab lister: make full listing on large instance more robust to concurrency writings
Investigate and fix:
Jun 30 09:00:54 worker15 python3[23590]: [2019-06-30 09:00:54,448: ERROR/ForkPoolWorker-2] Task swh.lister.gitlab.tasks.RangeGitLabLister[474d600e-ff5c-43b0-83f9-afc29b1cfd88] raised unexpected: IntegrityError('(psycopg2.IntegrityError) duplicate key value violates unique constraint "gitlab_repo_pkey"\nDETAIL: Key (uid)=(debian/nathanruiz-guest/apt) already exists.\n',) [13/6560]
Traceback (most recent call last):
File "/usr/lib/python3/dist-packages/sqlalchemy/engine/base.py", line 1139, in _execute_context
context)
File "/usr/lib/python3/dist-packages/sqlalchemy/engine/default.py", line 450, in do_execute
cursor.execute(statement, parameters)
psycopg2.IntegrityError: duplicate key value violates unique constraint "gitlab_repo_pkey"
DETAIL: Key (uid)=(debian/nathanruiz-guest/apt) already exists.
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/usr/lib/python3/dist-packages/celery/app/trace.py", line 382, in trace_task
R = retval = fun(*args, **kwargs)
File "/usr/lib/python3/dist-packages/swh/scheduler/task.py", line 45, in __call__
return super().__call__(*args, **kwargs)
File "/usr/lib/python3/dist-packages/celery/app/trace.py", line 641, in __protected_call__
return self.run(*args, **kwargs)
File "/usr/lib/python3/dist-packages/swh/lister/gitlab/tasks.py", line 36, in range_gitlab_lister
lister.run(min_bound=start, max_bound=end)
File "/usr/lib/python3/dist-packages/swh/lister/core/page_by_page_lister.py", line 123, in run
checks=check_existence)
File "/usr/lib/python3/dist-packages/swh/lister/core/lister_base.py", line 492, in ingest_data
injected = self.inject_repo_data_into_db(models_list)
File "/usr/lib/python3/dist-packages/swh/lister/core/lister_base.py", line 435, in inject_repo_data_into_db
injected_repos[m['uid']] = self.db_inject_repo(m)
File "/usr/lib/python3/dist-packages/swh/lister/core/lister_base.py", line 372, in db_inject_repo
sql_repo = self.db_query_equal('uid', model_dict['uid'])
File "/usr/lib/python3/dist-packages/swh/lister/core/lister_base.py", line 335, in db_query_equal
.filter(key == value).first()
File "/usr/lib/python3/dist-packages/sqlalchemy/orm/query.py", line 2659, in first
ret = list(self[0:1])
File "/usr/lib/python3/dist-packages/sqlalchemy/orm/query.py", line 2457, in __getitem__
return list(res)
File "/usr/lib/python3/dist-packages/sqlalchemy/orm/query.py", line 2760, in __iter__
self.session._autoflush()
File "/usr/lib/python3/dist-packages/sqlalchemy/orm/session.py", line 1303, in _autoflush
util.raise_from_cause(e)
File "/usr/lib/python3/dist-packages/sqlalchemy/util/compat.py", line 202, in raise_from_cause
reraise(type(exception), exception, tb=exc_tb, cause=cause)
File "/usr/lib/python3/dist-packages/sqlalchemy/util/compat.py", line 186, in reraise
raise value
File "/usr/lib/python3/dist-packages/sqlalchemy/orm/session.py", line 1293, in _autoflush
self.flush()
File "/usr/lib/python3/dist-packages/sqlalchemy/orm/session.py", line 2019, in flush
self._flush(objects)
File "/usr/lib/python3/dist-packages/sqlalchemy/orm/session.py", line 2137, in _flush
transaction.rollback(_capture_exception=True)
File "/usr/lib/python3/dist-packages/sqlalchemy/util/langhelpers.py", line 60, in __exit__
compat.reraise(exc_type, exc_value, exc_tb)
File "/usr/lib/python3/dist-packages/sqlalchemy/util/compat.py", line 186, in reraise
raise value
File "/usr/lib/python3/dist-packages/sqlalchemy/orm/session.py", line 2101, in _flush
flush_context.execute()
File "/usr/lib/python3/dist-packages/sqlalchemy/orm/unitofwork.py", line 373, in execute
rec.execute(self)
File "/usr/lib/python3/dist-packages/sqlalchemy/orm/unitofwork.py", line 532, in execute
uow
File "/usr/lib/python3/dist-packages/sqlalchemy/orm/persistence.py", line 174, in save_obj
mapper, table, insert)
File "/usr/lib/python3/dist-packages/sqlalchemy/orm/persistence.py", line 767, in _emit_insert_statements
execute(statement, multiparams)
File "/usr/lib/python3/dist-packages/sqlalchemy/engine/base.py", line 914, in execute
return meth(self, multiparams, params)
File "/usr/lib/python3/dist-packages/sqlalchemy/sql/elements.py", line 323, in _execute_on_connection
return connection._execute_clauseelement(self, multiparams, params)
File "/usr/lib/python3/dist-packages/sqlalchemy/engine/base.py", line 1010, in _execute_clauseelement
compiled_sql, distilled_params
File "/usr/lib/python3/dist-packages/sqlalchemy/engine/base.py", line 1146, in _execute_context
context)
File "/usr/lib/python3/dist-packages/sqlalchemy/engine/base.py", line 1341, in _handle_dbapi_exception
exc_info
File "/usr/lib/python3/dist-packages/sqlalchemy/util/compat.py", line 202, in raise_from_cause
reraise(type(exception), exception, tb=exc_tb, cause=cause)
File "/usr/lib/python3/dist-packages/sqlalchemy/util/compat.py", line 185, in reraise
raise value.with_traceback(tb)
File "/usr/lib/python3/dist-packages/sqlalchemy/engine/base.py", line 1139, in _execute_context
context)
File "/usr/lib/python3/dist-packages/sqlalchemy/engine/default.py", line 450, in do_execute
cursor.execute(statement, parameters)
sqlalchemy.exc.IntegrityError: (raised as a result of Query-invoked autoflush; consider using a session.no_autoflush block if this flush is occurring prematurely) (psycopg2.IntegrityError) duplicate key value violates unique constraint "gitlab_repo_pkey"
DETAIL: Key (uid)=(debian/nathanruiz-guest/apt) already exists.
[SQL: 'INSERT INTO gitlab_repo (name, full_name, html_url, origin_url, origin_type, last_seen, task_id, uid, instance) VALUES (%(name)s, %(full_name)s, %(html_url)s, %(origin_url)s, %(origin_type)s, %(last_seen)s, %(task_id)s, %(uid)s, %(instance)s)'] [parameters: {'instance': 'debian', 'last_seen': datetime.datetime(2019, 6, 30, 9, 0, 36,
155540), 'origin_url': 'https://salsa.debian.org/nathanruiz-guest/apt.git', 'full_name': 'nathanruiz-guest/apt', 'name': 'apt', 'html_url': 'https://salsa.debian.org/nathanruiz-guest/apt', 'task_id': None, 'origin_type': 'git', 'uid': 'debian/nathanruiz-guest/apt'}]
Jun 30 09:00:54 worker15 python3[23574]: [2019-06-30 09:00:54,518: INFO/MainProcess] Received task: swh.lister.gitlab.tasks.RangeGitLabLister[71da1490-b1ac-4d93-bc7f-5402472e05d1]
With @douardda, we might have encountered those occurrences already. It was possibly due to range interval overlap IMSMW.
In any case, that must be dealt with:
- by either checking the range computations to avoid overlap
- as a fallback, either trap those errors (if the source of the error is not found for example). Then make sure the main process continues to avoid having holes
Migrated from T1865 (view on Phabricator)