Add statsd metrics in JournalClient.process
trying to capture how long the journal client stays blocked waiting for kafka message, how long it takes to actually retrieve the messages, and how long it takes to handle them.
Migrated from D6875 (view on Phabricator)
Merge request reports
Activity
Build has FAILED
Patch application report for D6875 (id=24926)
Rebasing onto 4e5e0098...
Current branch diff-target is up to date.
Changes applied before test
commit 9d78e7857866872ec260af37ea9a4350644facc7 Author: David Douard <david.douard@sdfa3.org> Date: Fri Nov 19 17:16:06 2021 +0100 Add statsd metrics in JournalClient.process trying to capture how long the journal client stays blocked waiting for kafka message, how long it takes to actually retrieve the messages, and how long it takes to handle them.
Link to build: https://jenkins.softwareheritage.org/job/DJNL/job/tests-on-diff/186/ See console output for more information: https://jenkins.softwareheritage.org/job/DJNL/job/tests-on-diff/186/console
Build is green
Patch application report for D6875 (id=24929)
Rebasing onto 4e5e0098...
Current branch diff-target is up to date.
Changes applied before test
commit de55fb56ab1441a0074d66b43443bce692fa755b Author: David Douard <david.douard@sdfa3.org> Date: Fri Nov 19 17:16:06 2021 +0100 Add statsd metrics in JournalClient.process trying to capture how long the journal client stays blocked waiting for kafka message, how long it takes to actually retrieve the messages, and how long it takes to handle them.
See https://jenkins.softwareheritage.org/job/DJNL/job/tests-on-diff/187/ for more details.
28 29 KafkaError._NO_OFFSET, 29 30 ] 30 31 32 JOURNAL_BLOCKED_DURATION_METRIC = "swh_journal_client_blocked_duration_seconds" 33 JOURNAL_CONSUME_DURATION_METRIC = "swh_journal_client_consume_duration_seconds" 34 JOURNAL_HANDLE_DURATION_METRIC = "swh_journal_client_handle_duration_seconds" 35 JOURNAL_MESSAGE_NUMBER_METRIC = "swh_journal_client_handle_message_number" Should be plural and end with
_total
rather than "number" (ref. https://prometheus.io/docs/practices/naming/)
290 JOURNAL_CONSUME_DURATION_METRIC, 291 (time.monotonic() - start_time) - elapsed, 292 ) 277 293 278 batch_processed, at_eof = self.handle_messages(messages, worker_fn) 294 # report the time it takes to handle messages 295 with statsd.timed(JOURNAL_HANDLE_DURATION_METRIC): 296 batch_processed, at_eof = self.handle_messages(messages, worker_fn) 297 298 # report the number of handled messages 299 statsd.increment(JOURNAL_MESSAGE_NUMBER_METRIC, value=batch_processed) 279 300 total_objects_processed += batch_processed 280 301 if at_eof: 281 302 break 282 303 304 last_object_time = time.monotonic() - start_time statsd.timing
/statsd.timed
do full histograms. Do we really want to keep bucketed counts for all of these values, or just a running total?I have a hard time following the timing logic with all the subtractions. I think it would be easier to follow by:
- keeping explicit timestamps using bare calls to
time.monotonic()
at specific events - calling
statsd.timing(METRIC_FOR_OPERATION_BETWEEN_A_AND_B, ts_pointb - ts_pointa)
explicitly for each duration we want to track
- keeping explicit timestamps using bare calls to
! In !211 (closed), @olasd wrote:
statsd.timing
/statsd.timed
do full histograms. Do we really want to keep bucketed counts for all of these values, or just a running total?good question... need to think a bit about this.
I have a hard time following the timing logic with all the subtractions. I think it would be easier to follow by:
- keeping explicit timestamps using bare calls to
time.monotonic()
at specific events - calling
statsd.timing(METRIC_FOR_OPERATION_BETWEEN_A_AND_B, ts_pointb - ts_pointa)
explicitly for each duration we want to track
I agree
- keeping explicit timestamps using bare calls to
274 281 messages = self.consumer.consume(timeout=timeout, num_messages=batch_size) 275 282 if not messages: 276 283 continue 284 # report the time we stayed blocked waiting for kafka; 285 # more or less the time it took to have a successful call to consume() 286 statsd.timing(JOURNAL_BLOCKED_DURATION_METRIC, elapsed - last_object_time) I don't understand how this one works. If I'm not mistaken, this only times negligible pure-Python computation:
On the first loop run,
last_object_time
is 0 andelapsed
is the time it took to get from line 250 to line 258, which is essentially this code:start_time = time.monotonic() last_object_time = 0 total_objects_processed = 0 # while True: # timeout for message poll timeout = 1.0 elapsed = time.monotonic() - start_time
On other loop runs,
elapsed - last_object_time
is the time it takes from line 304 to line 258, which would be this code:last_object_time = time.monotonic() - start_time # while True: # timeout for message poll timeout = 1.0 elapsed = time.monotonic() - start_time