Project 'infra/sysadm-environment' was moved to 'swh/infra/sysadm-environment'. Please update any links and bookmarks that may still have the old path.
In order to simulate the cluster behavior with an usual swh load, a test should be prepare to have a ingestion infrastructure with a lot of loader running in parallel
Some adaptions need to be done on the bench lab to have a central swh-storage responding to the workers.
A fake objstorage needs to be added to handle the worker adds
and of course all the ingestion services, a lister / scheduler / rabbitmq ...
The docker infrastucture will probably be used (or perhaps kubernetes as grid5000 provides a k8s deployment environment)
the bench lab is easily deployable on g5k on several workers to distribute the load [1]
it's working well when the load is not so high. When the number of worker is increased, it seems the workers have some issues to talk with rabbitmq:
[loaders-77cdd444df-flcv9 loaders] [2021-09-30 23:46:55,449: INFO/MainProcess] missed heartbeat from celery@loaders-77cdd444df-p9ds5 [loaders-77cdd444df-flcv9 loaders] [2021-09-30 23:46:55,449: INFO/MainProcess] missed heartbeat from celery@loaders-77cdd444df-n6pvm [loaders-77cdd444df-flcv9 loaders] [2021-09-30 23:46:55,449: INFO/MainProcess] missed heartbeat from celery@loaders-77cdd444df-mrcjj [loaders-77cdd444df-flcv9 loaders] [2021-09-30 23:46:55,449: INFO/MainProcess] missed heartbeat from celery@loaders-77cdd444df-7bn4s [loaders-77cdd444df-flcv9 loaders] [2021-09-30 23:46:55,449: INFO/MainProcess] missed heartbeat from celery@loaders-77cdd444df-lg2bd
and also an unexplained time drift:
[loaders-77cdd444df-flcv9 loaders] [2021-09-30 23:46:55,447: WARNING/MainProcess] Substantial drift from celery@loaders-77cdd444df-lxjpl may mean clocks are out of sync. Current drift is [loaders-77cdd444df-flcv9 loaders] 356 seconds. [orig: 2021-09-30 23:46:55.447181 recv: 2021-09-30 23:40:59.633444] [loaders-77cdd444df-flcv9 loaders] [loaders-77cdd444df-flcv9 loaders] [2021-09-30 23:46:55,447: WARNING/MainProcess] Substantial drift from celery@loaders-77cdd444df-jd6v9 may mean clocks are out of sync. Current drift is [loaders-77cdd444df-flcv9 loaders] 355 seconds. [orig: 2021-09-30 23:46:55.447552 recv: 2021-09-30 23:41:00.723983] [loaders-77cdd444df-flcv9 loaders]
and several other issues with interrupted connections.
there are also issues with storage and objstorage tunning that need to be resolved to have a stable test environment and to have relevant metrics.
The loader were finally stabilized. It was due to a wrong celery configuration.
Changing the pool configuration from solo to prefork solved the problem even if the concurrency is kept to one.
Solo looked indicated in environment like the POC but for obvious reasons, it was not working as expected:
It was possible to increase the load with a stable behavior. Now, it seems cassadra is more in pain with 60 git loaders.
Several timeout are logged in the storage, all are on the revision_missing check:
ERROR:root:Error from server: code=1200 [Coordinator node timed out waiting for replica nodes' responses] message="Operation timed out - received only 0 responses." info={'consistency': 'LOCAL_QUORUM', 'required_responses': 2, 'received_responses': 0}Traceback (most recent call last): File "/srv/softwareheritage/venv/lib/python3.7/site-packages/flask/app.py", line 1950, in full_dispatch_request rv = self.dispatch_request() File "/srv/softwareheritage/venv/lib/python3.7/site-packages/flask/app.py", line 1936, in dispatch_request return self.view_functions[rule.endpoint](**req.view_args) File "/srv/softwareheritage/venv/lib/python3.7/site-packages/swh/core/api/negotiation.py", line 153, in newf return f.negotiator(*args, **kwargs) File "/srv/softwareheritage/venv/lib/python3.7/site-packages/swh/core/api/negotiation.py", line 81, in __call__ result = self.func(*args, **kwargs) File "/srv/softwareheritage/venv/lib/python3.7/site-packages/swh/core/api/__init__.py", line 460, in _f return obj_meth(**kw) File "/srv/softwareheritage/venv/lib/python3.7/site-packages/swh/storage/metrics.py", line 24, in d return f(*a, **kw) File "/srv/softwareheritage/venv/lib/python3.7/site-packages/swh/storage/cassandra/storage.py", line 683, in revision_missing return self._cql_runner.revision_missing(revisions) File "/srv/softwareheritage/venv/lib/python3.7/site-packages/swh/storage/cassandra/cql.py", line 162, in newf self, *args, **kwargs, statement=self._prepared_statements[f.__name__] File "/srv/softwareheritage/venv/lib/python3.7/site-packages/swh/storage/cassandra/cql.py", line 609, in revision_missing return self._missing(statement, ids) File "/srv/softwareheritage/venv/lib/python3.7/site-packages/swh/storage/cassandra/cql.py", line 337, in _missing rows = self._execute_with_retries(statement, [list(id_group)]) File "/srv/softwareheritage/venv/lib/python3.7/site-packages/tenacity/__init__.py", line 324, in wrapped_f return self(f, *args, **kw) File "/srv/softwareheritage/venv/lib/python3.7/site-packages/tenacity/__init__.py", line 404, in __call__ do = self.iter(retry_state=retry_state) File "/srv/softwareheritage/venv/lib/python3.7/site-packages/tenacity/__init__.py", line 349, in iter return fut.result() File "/usr/local/lib/python3.7/concurrent/futures/_base.py", line 428, in result return self.__get_result() File "/usr/local/lib/python3.7/concurrent/futures/_base.py", line 384, in __get_result raise self._exception File "/srv/softwareheritage/venv/lib/python3.7/site-packages/tenacity/__init__.py", line 407, in __call__ result = fn(*args, **kwargs) File "/srv/softwareheritage/venv/lib/python3.7/site-packages/swh/storage/cassandra/cql.py", line 293, in _execute_with_retries return self._session.execute(statement, args, timeout=1000.0) File "cassandra/cluster.py", line 2618, in cassandra.cluster.Session.execute File "cassandra/cluster.py", line 4894, in cassandra.cluster.ResponseFuture.resultcassandra.ReadTimeout: Error from server: code=1200 [Coordinator node timed out waiting for replica nodes' responses] message="Operation timed out - received only 0 responses." info={'consistency': 'LOCAL_QUORUM', 'required_responses': 2, 'received_responses': 0}
The objstorage also show some stability issues:
File "/srv/softwareheritage/venv/lib/python3.7/site-packages/flask/app.py", line 1950, in full_dispatch_request rv = self.dispatch_request() File "/srv/softwareheritage/venv/lib/python3.7/site-packages/flask/app.py", line 1936, in dispatch_request return self.view_functions[rule.endpoint](**req.view_args) File "/srv/softwareheritage/venv/lib/python3.7/site-packages/swh/core/api/negotiation.py", line 153, in newf return f.negotiator(*args, **kwargs) File "/srv/softwareheritage/venv/lib/python3.7/site-packages/swh/core/api/negotiation.py", line 81, in __call__ result = self.func(*args, **kwargs) File "/srv/softwareheritage/venv/lib/python3.7/site-packages/swh/core/api/__init__.py", line 460, in _f return obj_meth(**kw) File "/srv/softwareheritage/venv/lib/python3.7/site-packages/swh/storage/metrics.py", line 24, in d return f(*a, **kw) File "/srv/softwareheritage/venv/lib/python3.7/site-packages/swh/storage/metrics.py", line 77, in d r = f(*a, **kw) File "/srv/softwareheritage/venv/lib/python3.7/site-packages/swh/storage/cassandra/storage.py", line 265, in content_add return self._content_add(list(contents), with_data=True) File "/srv/softwareheritage/venv/lib/python3.7/site-packages/swh/storage/cassandra/storage.py", line 197, in _content_add c for c in contents if c.status != "absent" File "/srv/softwareheritage/venv/lib/python3.7/site-packages/swh/storage/objstorage.py", line 62, in content_add summary = self.objstorage.add_batch({cont.sha1: cont.data for cont in contents}) File "/srv/softwareheritage/venv/lib/python3.7/site-packages/swh/objstorage/api/client.py", line 47, in add_batch {"contents": contents, "check_presence": check_presence,}, File "/srv/softwareheritage/venv/lib/python3.7/site-packages/swh/core/api/__init__.py", line 272, in post **opts, File "/srv/softwareheritage/venv/lib/python3.7/site-packages/swh/core/api/__init__.py", line 256, in raw_verb raise self.api_exception(e)swh.objstorage.exc.ObjStorageAPIError: An unexpected error occurred in the api backend: ('Connection aborted.', ConnectionResetError(104, 'Connection reset by peer'))
The I/O, read mostly, is quite high on cassandra servers:
Immediate actions for the next run:
as there is plenty of memory on the servers, create a memory volume on all the worker nodes and start several objstorage writing the content on them
check the revision_missing function for possible improvments
Some runs with the fix:
It globally improves the stability of the benchmark by reducing the timeouts.
The 6 first run was with different algorithms with 100 workers.
group_pk_concurrent
group_pk_concurrent
concurrent
concurrent
grouped_naive
group_pk_serial
The last one is with 150 workers and the grouped-pk-concurrent
There is not a big difference between the different runs in number of visits/s, the difference are most in number of client timeouts:
It's not clear for the moment where the bottleneck is located but according to the disk i/o profile, it looks it due to read, probably still on the _missing methods
This is the impact of configuration the add batch insert size. Tested values were 500, 1000 and 2000.
A full repair of a cassandra node was in progress due to a datalost (disk erased by someone else's experiment after the disk reservation expiration)
Apparently a smaller batch size give better performances for all the scenarios
The 1000 value seems to have a weird behavior for the directories, I launched the test several times for this size but seems to give always the same kind of results, I can't explain why
This is the impacts of the different algorithm usage for retrieving the missing objects.
The configuration is the same as previously, the storage is patched with D6423, D6494 and D6495.
The first without D6423 is a version patched witj D6494 and D6495 only
The results are quite different according to the type of object. It looks like the group pk concurrent implementations could give result to an overall performance improvements (revision, directory)
The counterpart is as the *_missing functions are faster, the add functions are more stressed.
The max number of visit/s reached was around 20ops/s.
The limit is reached by the cassandra HDD limited by the read bandwith so the results are specifics to the current hardware.
It seems a potential configuration could be:
insert batch size reduced to 500 (hard coded to 1000 in the code currently)
Changing the cache size doesn't seem very effective for the particular workload of the loaders:
Exception some contextual differences due probably due to the different origin loaded, the performances are quite similar.
The hit ratios between the different configurations are also very close