Trigger an end-to-end check results in failure [1] even though the actual loading is fine [2].
And the report ui reports some traceback [3] that's nowhere to be found in the actual loader output [2]
06:05 <+swhbot> icinga PROBLEM: service production Check save-code-now panda with type svn end-to-end on pergamon.softwareheritage.org is CRITICAL: SAVECODENOW CRITICAL - Save code now request for origin ('svn', 'https://subversion.renater.fr/anonscm/svn/panda') took 11.53s and failed.
Traceback (most recent call last):Proxy ErrorThe proxy server received an invalid response from an upstream server.The proxy server could not handle the requestReason: Error reading from remote server
Since the loader is ok, the issue seems to be around the communication between the node triggering the check and the rest. So let's try triggering things from pergamon.
From pergamon, triggering the icinga command with the hgview origin results with the
same kind of error of the report (with logs this time [1]). The loader output is still fine [2].
So it's a monitoring problem not another kind of issue.
[1]
swhworker@pergamon:~$ /usr/bin/swh icinga_plugins --warning 300 --critical 600 check-savecodenow --swh-web-url https://archive.softwareheritage.org origin https://foss.heptapod.net/mercurial/hgview --visit-type hgTraceback (most recent call last): File "/usr/bin/swh", line 11, in <module> load_entry_point('swh.core==1.0.0', 'console_scripts', 'swh')() File "/usr/lib/python3/dist-packages/swh/core/cli/__init__.py", line 185, in main return swh(auto_envvar_prefix="SWH") File "/usr/lib/python3/dist-packages/click/core.py", line 764, in __call__ return self.main(*args, **kwargs) File "/usr/lib/python3/dist-packages/click/core.py", line 717, in main rv = self.invoke(ctx) File "/usr/lib/python3/dist-packages/click/core.py", line 1137, in invoke return _process_result(sub_ctx.command.invoke(sub_ctx)) File "/usr/lib/python3/dist-packages/click/core.py", line 1137, in invoke return _process_result(sub_ctx.command.invoke(sub_ctx)) File "/usr/lib/python3/dist-packages/click/core.py", line 1137, in invoke return _process_result(sub_ctx.command.invoke(sub_ctx)) File "/usr/lib/python3/dist-packages/click/core.py", line 956, in invoke return ctx.invoke(self.callback, **ctx.params) File "/usr/lib/python3/dist-packages/click/core.py", line 555, in invoke return callback(*args, **kwargs) File "/usr/lib/python3/dist-packages/click/decorators.py", line 17, in new_func return f(get_current_context(), *args, **kwargs) File "/usr/lib/python3/dist-packages/swh/icinga_plugins/cli.py", line 87, in check_scn_origin sys.exit(SaveCodeNowCheck(ctx.obj, origin, visit_type).main()) File "/usr/lib/python3/dist-packages/swh/icinga_plugins/save_code_now.py", line 64, in main ), f"Unexpected response: {response}, {response.text}"AssertionError: Unexpected response: <Response [502]>, <!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML 2.0//EN"><html><head><title>502 Proxy Error</title></head><body><h1>Proxy Error</h1><p>The proxy server received an invalidresponse from an upstream server.<br />The proxy server could not handle the request<p>Reason: <strong>Error reading from remote server</strong></p></p></body></html>
Finally reproduced through curl (and subsequent cached queries are fine after that [1]).
$ curl https://webapp1.internal.softwareheritage.org/api/1/origin/save/hg/url/https://foss.heptapod.net/mercurial/hgview/<!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML 2.0//EN"><html><head><title>502 Proxy Error</title></head><body><h1>Proxy Error</h1><p>The proxy server received an invalidresponse from an upstream server.<br />The proxy server could not handle the request<p>Reason: <strong>Error reading from remote server</strong></p></p></body></html>
[1] migrated/migration$1261 (it's way too much information for the icinga check btw)
The congestion seems to occur around the scheduler and the webapp communication.
High number of task runs to convert back and forth...
Nope, in the end, it's more the back and forth between swh-web and swh-storage. See
below for details.
The following problem is occuring for origin with a high number of visits. So far the
icinga checks is happening on 3 distinct origins whose number is around 8k each.
The actual webapp code executes an api call to retrieve a list of origin visits [2].
Then for each origin visit, it calls the origin visit status api to retrieve its
associated status [2]. Those endpoints are paginated but we are doing the full retrieval
on of those origin visit statuses [1] and then we sort them. This ends up on a high
number of round-trips requests which exceed the timeout of 60 seconds [5] and then this
issue happens.
We can reproduce the encoutered issue just by browsing the ui [3] on origins with high
number of visits (after a new visit on save code now for example). This problem occurs
both in staging [3] and production [4].
As another data point, it also happens that, from time to time, one of those
origin-visit api calls end up timeouting resulting in yet another error [6] (see the
related sentry link in the description). That's occuring since we have a hard-coded
timeout of 500ms happening storage side and the query for those origins largely exceed
it [6]. That's not something we'll be focusing on immediately.
15:48:15 softwareheritage@belvedere:5432=> select * FROM origin_visit ov INNER JOIN origin o ON o.id = ov.origin INNER JOIN origin_visit_status ovs USING (origin, visit) WHERE o.url = 'https://foss.heptapod.net/mercurial/hgview' AND ovs.snapshot is not null ORDER BY ov.date DESC, ov.visit DESC, ovs.date DESC LIMIT 1;+-----------+-------+------------------------------+------+-----------+--------------------------------------------+-------------------------------+--------+----------+--------------------------------------------+------+| origin | visit | date | type | id | url | date | status | metadata | snapshot | type |+-----------+-------+------------------------------+------+-----------+--------------------------------------------+-------------------------------+--------+----------+--------------------------------------------+------+| 155049819 | 27873 | 2022-02-02 14:43:06.64555+00 | hg | 155049819 | https://foss.heptapod.net/mercurial/hgview | 2022-02-02 14:43:20.744576+00 | full | (null) | \x94aae2785f73ccaec5983b9f5ce071eed323eccf | hg |+-----------+-------+------------------------------+------+-----------+--------------------------------------------+-------------------------------+--------+----------+--------------------------------------------+------+(1 row)Time: 8490.886 ms (00:08.491) # <----- 8490ms > 500ms so timeout through swh-storage
explain select * FROM origin_visit ov INNER JOIN origin o ON o.id = ov.origin INNER JOIN origin_visit_status ovs USING (origin, visit) WHERE o.url = 'https://foss.heptapod.net/mercurial/hgview' AND ovs.snapshot is not null ORDER BY ov.date DESC, ov.visit DESC, ovs.date DESC LIMIT 1;+---------------------------------------------------------------------------------------------------------------------------+| QUERY PLAN |+---------------------------------------------------------------------------------------------------------------------------+| Limit (cost=658.51..658.51 rows=1 width=173) || -> Sort (cost=658.51..658.53 rows=6 width=173) || Sort Key: ov.date DESC, ov.visit DESC, ovs.date DESC || -> Nested Loop (cost=1.28..658.48 rows=6 width=173) || Join Filter: (o.id = ovs.origin) || -> Nested Loop (cost=0.58..639.03 rows=7 width=80) || -> Index Scan using origin_url_idx1 on origin o (cost=0.00..2.02 rows=1 width=52) || Index Cond: (url = 'https://foss.heptapod.net/mercurial/hgview'::text) || -> Index Scan using origin_visit_pkey on origin_visit ov (cost=0.58..630.63 rows=638 width=28) || Index Cond: (origin = o.id) || -> Index Scan using origin_visit_status_pkey on origin_visit_status ovs (cost=0.70..2.75 rows=2 width=85) || Index Cond: ((origin = ov.origin) AND (visit = ov.visit)) || Filter: (snapshot IS NOT NULL) |+---------------------------------------------------------------------------------------------------------------------------+(13 rows)Time: 6.089 ms
As a first workaround, D7076 improves the cache use which should help when the cache is
populated (prior to that diff we mostly iterated over all visits even with the populated cache).
Another diff D7078 tries to improve the information retrieval but it's revealing
challenging. We actually may be in need of a new storage endpoint to simplify the
overall status computations in the webapp (migrated/migration$1265).
Deployed and the icinga checks are getting back green now.
18:04 <+swhbot> icinga RECOVERY: service production Check save-code-now hgview with type hg end-to-end on pergamon.softwareheritage.org is OK: SAVECODENOW OK - Save code now request for origin ('hg', 'https://foss.heptapod.net/mercurial/hgview') took 10.82s and succeeded.18:06 <+ardumont> hi ho, it's fixed in the end18:06 <+ardumont> vsellier: ^18:06 <+ardumont> \o/18:06 <+ardumont> "sur le fil"18:06 <+swhbot> icinga RECOVERY: service staging Check save-code-now hgview with type hg end-to-end on pergamon.softwareheritage.org is OK: SAVECODENOW OK - Save code now request for origin ('hg', 'https://foss.heptapod.net/mercurial/hgview') took 21.40s and succeeded.18:13 <+swhbot> icinga RECOVERY: service production Check save-code-now panda with type svn end-to-end on pergamon.softwareheritage.org is OK: SAVECODENOW OK - Save code now request for origin ('svn', 'https://subversion.renater.fr/anonscm/svn/panda') took 21.50s and succeeded.