Parallel loaders performances

marked this issue as related to #3357 (closed)

added Storage manager System administration priority:Normal state:wip labels

changed the description

mentioned in commit swh/devel/snippets@118ba1a0

mentioned in commit swh/devel/snippets@7ec3d29c

mentioned in commit swh/devel/snippets@7cc495e3

intermediary status:

the bench lab is easily deployable on g5k on several workers to distribute the load [1]
it's working well when the load is not so high. When the number of worker is increased, it seems the workers have some issues to talk with rabbitmq:

[loaders-77cdd444df-flcv9 loaders] [2021-09-30 23:46:55,449: INFO/MainProcess] missed heartbeat from celery@loaders-77cdd444df-p9ds5                                                                                                                                                     
[loaders-77cdd444df-flcv9 loaders] [2021-09-30 23:46:55,449: INFO/MainProcess] missed heartbeat from celery@loaders-77cdd444df-n6pvm                    
[loaders-77cdd444df-flcv9 loaders] [2021-09-30 23:46:55,449: INFO/MainProcess] missed heartbeat from celery@loaders-77cdd444df-mrcjj                                                                                                                                                     
[loaders-77cdd444df-flcv9 loaders] [2021-09-30 23:46:55,449: INFO/MainProcess] missed heartbeat from celery@loaders-77cdd444df-7bn4s                                                                                       
[loaders-77cdd444df-flcv9 loaders] [2021-09-30 23:46:55,449: INFO/MainProcess] missed heartbeat from celery@loaders-77cdd444df-lg2bd

and also an unexplained time drift:

[loaders-77cdd444df-flcv9 loaders] [2021-09-30 23:46:55,447: WARNING/MainProcess] Substantial drift from celery@loaders-77cdd444df-lxjpl may mean clocks are out of sync.  Current drift is 
[loaders-77cdd444df-flcv9 loaders] 356 seconds.  [orig: 2021-09-30 23:46:55.447181 recv: 2021-09-30 23:40:59.633444]                                                                                                                                                                     
[loaders-77cdd444df-flcv9 loaders]                                                                                                                                                                            
[loaders-77cdd444df-flcv9 loaders] [2021-09-30 23:46:55,447: WARNING/MainProcess] Substantial drift from celery@loaders-77cdd444df-jd6v9 may mean clocks are out of sync.  Current drift is                                                                                              
[loaders-77cdd444df-flcv9 loaders] 355 seconds.  [orig: 2021-09-30 23:46:55.447552 recv: 2021-09-30 23:41:00.723983]                                  
[loaders-77cdd444df-flcv9 loaders]

and several other issues with interrupted connections.

there are also issues with storage and objstorage tunning that need to be resolved to have a stable test environment and to have relevant metrics.

[loaders-77cdd444df-mf85h loaders] [2021-09-30 23:47:00,865: WARNING/MainProcess] Retrying RPC call                                                                  
[loaders-77cdd444df-wzntq loaders] [2021-09-30 23:47:00,864: WARNING/MainProcess] Retrying RPC call                                                                                                                                                                                      
[loaders-77cdd444df-hskqx loaders] [2021-09-30 23:47:00,866: WARNING/MainProcess] Retrying RPC call                                                             
[loaders-77cdd444df-cfgj8 loaders] [2021-09-30 23:47:00,866: WARNING/MainProcess] Retrying RPC call

The stats are not impressive for the moment but everything collapse (on the swh, stack, cassandra is performing well):

It can do much better

[1] rDSNIP7cc495e333e2753f849aa8756e04d77713e5b64e

The loader were finally stabilized. It was due to a wrong celery configuration. Changing the pool configuration from solo to prefork solved the problem even if the concurrency is kept to one. Solo looked indicated in environment like the POC but for obvious reasons, it was not working as expected:

It was possible to increase the load with a stable behavior. Now, it seems cassadra is more in pain with 60 git loaders. Several timeout are logged in the storage, all are on the revision_missing check:

ERROR:root:Error from server: code=1200 [Coordinator node timed out waiting for replica nodes' responses] message="Operation timed out - received only 0 responses." info={'consistency': 'LOCAL_QUORUM', 'required_responses': 2, 'received_responses': 0}
Traceback (most recent call last):
  File "/srv/softwareheritage/venv/lib/python3.7/site-packages/flask/app.py", line 1950, in full_dispatch_request
    rv = self.dispatch_request()
  File "/srv/softwareheritage/venv/lib/python3.7/site-packages/flask/app.py", line 1936, in dispatch_request
    return self.view_functions[rule.endpoint](**req.view_args)
  File "/srv/softwareheritage/venv/lib/python3.7/site-packages/swh/core/api/negotiation.py", line 153, in newf
    return f.negotiator(*args, **kwargs)
  File "/srv/softwareheritage/venv/lib/python3.7/site-packages/swh/core/api/negotiation.py", line 81, in __call__
    result = self.func(*args, **kwargs)
  File "/srv/softwareheritage/venv/lib/python3.7/site-packages/swh/core/api/__init__.py", line 460, in _f
    return obj_meth(**kw)
  File "/srv/softwareheritage/venv/lib/python3.7/site-packages/swh/storage/metrics.py", line 24, in d
    return f(*a, **kw)
  File "/srv/softwareheritage/venv/lib/python3.7/site-packages/swh/storage/cassandra/storage.py", line 683, in revision_missing
    return self._cql_runner.revision_missing(revisions)
  File "/srv/softwareheritage/venv/lib/python3.7/site-packages/swh/storage/cassandra/cql.py", line 162, in newf
    self, *args, **kwargs, statement=self._prepared_statements[f.__name__]
  File "/srv/softwareheritage/venv/lib/python3.7/site-packages/swh/storage/cassandra/cql.py", line 609, in revision_missing
    return self._missing(statement, ids)
  File "/srv/softwareheritage/venv/lib/python3.7/site-packages/swh/storage/cassandra/cql.py", line 337, in _missing
    rows = self._execute_with_retries(statement, [list(id_group)])
  File "/srv/softwareheritage/venv/lib/python3.7/site-packages/tenacity/__init__.py", line 324, in wrapped_f
    return self(f, *args, **kw)
  File "/srv/softwareheritage/venv/lib/python3.7/site-packages/tenacity/__init__.py", line 404, in __call__
    do = self.iter(retry_state=retry_state)
  File "/srv/softwareheritage/venv/lib/python3.7/site-packages/tenacity/__init__.py", line 349, in iter
    return fut.result()
  File "/usr/local/lib/python3.7/concurrent/futures/_base.py", line 428, in result
    return self.__get_result()
  File "/usr/local/lib/python3.7/concurrent/futures/_base.py", line 384, in __get_result
    raise self._exception
  File "/srv/softwareheritage/venv/lib/python3.7/site-packages/tenacity/__init__.py", line 407, in __call__
    result = fn(*args, **kwargs)
  File "/srv/softwareheritage/venv/lib/python3.7/site-packages/swh/storage/cassandra/cql.py", line 293, in _execute_with_retries
    return self._session.execute(statement, args, timeout=1000.0)
  File "cassandra/cluster.py", line 2618, in cassandra.cluster.Session.execute
  File "cassandra/cluster.py", line 4894, in cassandra.cluster.ResponseFuture.result
cassandra.ReadTimeout: Error from server: code=1200 [Coordinator node timed out waiting for replica nodes' responses] message="Operation timed out - received only 0 responses." info={'consistency': 'LOCAL_QUORUM', 'required_responses': 2, 'received_responses': 0}

The objstorage also show some stability issues:

  File "/srv/softwareheritage/venv/lib/python3.7/site-packages/flask/app.py", line 1950, in full_dispatch_request
    rv = self.dispatch_request()
  File "/srv/softwareheritage/venv/lib/python3.7/site-packages/flask/app.py", line 1936, in dispatch_request
    return self.view_functions[rule.endpoint](**req.view_args)
  File "/srv/softwareheritage/venv/lib/python3.7/site-packages/swh/core/api/negotiation.py", line 153, in newf
    return f.negotiator(*args, **kwargs)
  File "/srv/softwareheritage/venv/lib/python3.7/site-packages/swh/core/api/negotiation.py", line 81, in __call__
    result = self.func(*args, **kwargs)
  File "/srv/softwareheritage/venv/lib/python3.7/site-packages/swh/core/api/__init__.py", line 460, in _f
    return obj_meth(**kw)
  File "/srv/softwareheritage/venv/lib/python3.7/site-packages/swh/storage/metrics.py", line 24, in d
    return f(*a, **kw)
  File "/srv/softwareheritage/venv/lib/python3.7/site-packages/swh/storage/metrics.py", line 77, in d
    r = f(*a, **kw)
  File "/srv/softwareheritage/venv/lib/python3.7/site-packages/swh/storage/cassandra/storage.py", line 265, in content_add
    return self._content_add(list(contents), with_data=True)
  File "/srv/softwareheritage/venv/lib/python3.7/site-packages/swh/storage/cassandra/storage.py", line 197, in _content_add
    c for c in contents if c.status != "absent"
  File "/srv/softwareheritage/venv/lib/python3.7/site-packages/swh/storage/objstorage.py", line 62, in content_add
    summary = self.objstorage.add_batch({cont.sha1: cont.data for cont in contents})
  File "/srv/softwareheritage/venv/lib/python3.7/site-packages/swh/objstorage/api/client.py", line 47, in add_batch
    {"contents": contents, "check_presence": check_presence,},
  File "/srv/softwareheritage/venv/lib/python3.7/site-packages/swh/core/api/__init__.py", line 272, in post
    **opts,
  File "/srv/softwareheritage/venv/lib/python3.7/site-packages/swh/core/api/__init__.py", line 256, in raw_verb
    raise self.api_exception(e)
swh.objstorage.exc.ObjStorageAPIError: An unexpected error occurred in the api backend: ('Connection aborted.', ConnectionResetError(104, 'Connection reset by peer'))

The I/O, read mostly, is quite high on cassandra servers:

Immediate actions for the next run:

as there is plenty of memory on the servers, create a memory volume on all the worker nodes and start several objstorage writing the content on them
check the revision_missing function for possible improvments

mentioned in commit swh/devel/snippets@2a7a2efa

Some runs with the fix: It globally improves the stability of the benchmark by reducing the timeouts.

The 6 first run was with different algorithms with 100 workers.

group_pk_concurrent
group_pk_concurrent
concurrent
concurrent
grouped_naive
group_pk_serial

The last one is with 150 workers and the grouped-pk-concurrent

There is not a big difference between the different runs in number of visits/s, the difference are most in number of client timeouts:

It's not clear for the moment where the bottleneck is located but according to the disk i/o profile, it looks it due to read, probably still on the _missing methods

The most important latencies are on range selects

mentioned in commit swh/devel/swh-storage@d5f1f0ec

mentioned in commit swh/devel/snippets@0d2c1eb8

This is the impact of configuration the add batch insert size. Tested values were 500, 1000 and 2000. A full repair of a cassandra node was in progress due to a datalost (disk erased by someone else's experiment after the disk reservation expiration)

The configuration was:

4 swh-storage running (128 workers / 1 thread)
4 objstorage writing in a ramfs
100 loaders
directory entry insert algo: batch
Patched with D6423 but with default values

The graphs are the request durations per decile. The lower the better

content

directory:

revision

Apparently a smaller batch size give better performances for all the scenarios The 1000 value seems to have a weird behavior for the directories, I launched the test several times for this size but seems to give always the same kind of results, I can't explain why

This is the impacts of the different algorithm usage for retrieving the missing objects. The configuration is the same as previously, the storage is patched with D6423, D6494 and D6495. The first without D6423 is a version patched witj D6494 and D6495 only

content:

directory:

revision:

The results are quite different according to the type of object. It looks like the group pk concurrent implementations could give result to an overall performance improvements (revision, directory) The counterpart is as the *_missing functions are faster, the add functions are more stressed.

Closing here the tests of the workers.

The max number of visit/s reached was around 20ops/s. The limit is reached by the cassandra HDD limited by the read bandwith so the results are specifics to the current hardware.

It seems a potential configuration could be:

insert batch size reduced to 500 (hard coded to 1000 in the code currently)
storage:

      consistency_level: LOCAL_QUORUM
      select_missing_algo: grouped-pk-concurrent
      directory_entries_insert_algo: batch

-loaders (same configuration copied from the replayers):

      - cls: buffer
        min_batch_size:
          content: 500
          content_bytes: 104857600
          directory: 1000
          revision: 50

It will need to be adapted according to the new hardware.

A last run will be launched tonight with a bigger cache on cassandra. The hit ratio is quite low (less than Origin-Gitorious%)

Changing the cache size doesn't seem very effective for the particular workload of the loaders: Exception some contextual differences due probably due to the different origin loaded, the performances are quite similar. The hit ratios between the different configurations are also very close

content

directory

revision

Cache Hit ratio:

100Mb (default):

512Mb:

1024 Mb:

Stopping here the investigations as the limit is now the hardware running cassandra (HDD). Great improvements were made compared to the first tries

removed state:wip label

closed

mentioned in commit swh/devel/swh-storage@4a245050

mentioned in merge request swh/devel/swh-storage!727 (closed)

mentioned in merge request swh/devel/swh-storage!736 (closed)

Parallel loaders performances

Designs

Child items ...

Activity