Skip to content
Snippets Groups Projects

Add statsd metrics in JournalClient.process

3 unresolved threads

trying to capture how long the journal client stays blocked waiting for kafka message, how long it takes to actually retrieve the messages, and how long it takes to handle them.


Migrated from D6875 (view on Phabricator)

Merge request reports

Closed by Phabricator Migration userPhabricator Migration user 3 years ago (Jan 14, 2022 9:40am UTC)

Merge details

  • The changes were not merged into .

Activity

Filter activity
  • Approvals
  • Assignees & reviewers
  • Comments (from bots)
  • Comments (from users)
  • Commits & branches
  • Edits
  • Labels
  • Lock status
  • Mentions
  • Merge request status
  • Tracking
28 29 KafkaError._NO_OFFSET,
29 30 ]
30 31
32 JOURNAL_BLOCKED_DURATION_METRIC = "swh_journal_client_blocked_duration_seconds"
33 JOURNAL_CONSUME_DURATION_METRIC = "swh_journal_client_consume_duration_seconds"
34 JOURNAL_HANDLE_DURATION_METRIC = "swh_journal_client_handle_duration_seconds"
35 JOURNAL_MESSAGE_NUMBER_METRIC = "swh_journal_client_handle_message_number"
  • 290 JOURNAL_CONSUME_DURATION_METRIC,
    291 (time.monotonic() - start_time) - elapsed,
    292 )
    277 293
    278 batch_processed, at_eof = self.handle_messages(messages, worker_fn)
    294 # report the time it takes to handle messages
    295 with statsd.timed(JOURNAL_HANDLE_DURATION_METRIC):
    296 batch_processed, at_eof = self.handle_messages(messages, worker_fn)
    297
    298 # report the number of handled messages
    299 statsd.increment(JOURNAL_MESSAGE_NUMBER_METRIC, value=batch_processed)
    279 300 total_objects_processed += batch_processed
    280 301 if at_eof:
    281 302 break
    282 303
    304 last_object_time = time.monotonic() - start_time
  • statsd.timing/statsd.timed do full histograms. Do we really want to keep bucketed counts for all of these values, or just a running total?

    I have a hard time following the timing logic with all the subtractions. I think it would be easier to follow by:

    • keeping explicit timestamps using bare calls to time.monotonic() at specific events
    • calling statsd.timing(METRIC_FOR_OPERATION_BETWEEN_A_AND_B, ts_pointb - ts_pointa) explicitly for each duration we want to track
  • Author Maintainer

    ! In !211 (closed), @olasd wrote: statsd.timing/statsd.timed do full histograms. Do we really want to keep bucketed counts for all of these values, or just a running total?

    good question... need to think a bit about this.

    I have a hard time following the timing logic with all the subtractions. I think it would be easier to follow by:

    • keeping explicit timestamps using bare calls to time.monotonic() at specific events
    • calling statsd.timing(METRIC_FOR_OPERATION_BETWEEN_A_AND_B, ts_pointb - ts_pointa) explicitly for each duration we want to track

    I agree

  • vlorentz
    vlorentz @vlorentz started a thread on the diff
  • 274 281 messages = self.consumer.consume(timeout=timeout, num_messages=batch_size)
    275 282 if not messages:
    276 283 continue
    284 # report the time we stayed blocked waiting for kafka;
    285 # more or less the time it took to have a successful call to consume()
    286 statsd.timing(JOURNAL_BLOCKED_DURATION_METRIC, elapsed - last_object_time)
    • I don't understand how this one works. If I'm not mistaken, this only times negligible pure-Python computation:

      On the first loop run, last_object_time is 0 and elapsed is the time it took to get from line 250 to line 258, which is essentially this code:

              start_time = time.monotonic()
              last_object_time = 0
              total_objects_processed = 0
      
              # while True:
                  # timeout for message poll
                  timeout = 1.0
      
                  elapsed = time.monotonic() - start_time

      On other loop runs, elapsed - last_object_time is the time it takes from line 304 to line 258, which would be this code:

                  last_object_time = time.monotonic() - start_time
      
              # while True:
                  # timeout for message poll
                  timeout = 1.0
      
                  elapsed = time.monotonic() - start_time
    • Author Maintainer

      I made a better version of this part, but in the end, it won't be merged, this is not really the metrics we want and the produced histogram are pretty much useless.

    • Please register or sign in to reply
  • Author Maintainer

    Merge request was abandoned

  • closed

  • Please register or sign in to reply
    Loading