[cassandra] Profile the replayer cpu consumption

marked this issue as related to #4373 (closed)

added Storage manager System administration priority:Normal labels

here some profiling of a couple of replayers:

directory

swh@storage-replayer-directory-798dbd5b84-s648s:~$ time python -m cProfile -o /tmp/directory.pyprof /opt/swh/.local/bin/swh storage replay --stop-after-objects 1000
WARNING:cassandra.cluster:Downgrading core protocol version from 66 to 65 for 192.168.100.185:9042. To avoid this, it is best practice to explicitly set Cluster(protocol_version) to the version supported by your cluster. http://datastax.github.io/python-driver/api/cassandra/cluster.html#cassandra.cluster.Cluster.protocol_version
WARNING:cassandra.cluster:Downgrading core protocol version from 65 to 5 for 192.168.100.185:9042. To avoid this, it is best practice to explicitly set Cluster(protocol_version) to the version supported by your cluster. http://datastax.github.io/python-driver/api/cassandra/cluster.html#cassandra.cluster.Cluster.protocol_version
INFO:cassandra.policies:Using datacenter 'sesi_rocquencourt' for DCAwareRoundRobinPolicy (via host '192.168.100.185:9042'); if incorrect, please specify a local_dc to the constructor, or limit contact points to local cluster nodes
Done.

real	13m36.035s
user	2m0.203s
sys	0m19.395s

directory.pyprof.gz

origin-visit

swh@storage-replayer-origin-visit-76f6bf9d75-znqfs:~$ time python -m cProfile -o /tmp/origin-visit.pyprof /opt/swh/.local/bin/swh storage replay --stop-after-objects 10000
WARNING:cassandra.cluster:Downgrading core protocol version from 66 to 65 for 192.168.100.181:9042. To avoid this, it is best practice to explicitly set Cluster(protocol_version) to the version supported by your cluster. http://datastax.github.io/python-driver/api/cassandra/cluster.html#cassandra.cluster.Cluster.protocol_version
WARNING:cassandra.cluster:Downgrading core protocol version from 65 to 5 for 192.168.100.181:9042. To avoid this, it is best practice to explicitly set Cluster(protocol_version) to the version supported by your cluster. http://datastax.github.io/python-driver/api/cassandra/cluster.html#cassandra.cluster.Cluster.protocol_version
INFO:cassandra.policies:Using datacenter 'sesi_rocquencourt' for DCAwareRoundRobinPolicy (via host '192.168.100.181:9042'); if incorrect, please specify a local_dc to the constructor, or limit contact points to local cluster nodes
Done.

real	7m43.700s
user	2m42.825s
sys	0m27.594s

origin-visit.pyprof.gz

revision

swh@storage-replayer-revision-d7f4c666-prwd5:~$ time python -m cProfile -o /tmp/revision.pyprof /opt/swh/.local/bin/swh storage replay --stop-after-objects 20000

... A lot of logs like the following one...
ERROR:swh.storage.replay:Object has id a1e746dc5db73c6f2a6665367d3a563181a9691e, but it should be 2af7da7563c6d41ad0d2a35c6e1aa9e01b8aee6f: Revision(message=b'Update versions in documentation\n', author=Person(fullname=b'Mark <REDACTED> <REDACTED@REDACTED.com>', name=b'Mark REDACTED email=b'REDACTED@REDACTED.com'), ..., date=TimestampWithTimezone(timestamp=Timestamp(seconds=1376758767, microseconds=0), offset_bytes=b'+0000'), committer_date=TimestampWithTimezone(timestamp=Timestamp(seconds=1376758767, microseconds=0), offset_bytes=b'+0000'), type=RevisionType.GIT, directory=hash_to_bytes('f24d2178f07949b36876e7749f5b392610fdb31e'), synthetic=False, metadata=None, parents=(b'\xff\x8f\xe7q!\x02\x9e\x00@/\x9fr\xf9\xaa\xf1O@\xde\x07X',), id=hash_to_bytes('a1e746dc5db73c6f2a6665367d3a563181a9691e'), extra_headers=(), raw_manifest=None)
...
real	8m8.459s
user	4m13.189s
sys	0m29.120s

revision.pyprof.gz