Skip to content
Snippets Groups Projects

core.loader: Log information about origin currently being ingested

Prior to this commit, we "info" logged no meaningful information [1] about the origin being loaded.

  • [1] The only information we have for now is what configuration file we loaded which i don't find satisfying... swh-core!168 (closed)

Test Plan

tox


Migrated from D4012 (view on Phabricator)

Merge request reports

Loading
Loading

Activity

Filter activity
  • Approvals
  • Assignees & reviewers
  • Comments (from bots)
  • Comments (from users)
  • Commits & branches
  • Edits
  • Labels
  • Lock status
  • Mentions
  • Merge request status
  • Tracking
  • Build is green

    Patch application report for D4012 (id=14154)

    Rebasing onto 452fa224...

    Current branch diff-target is up to date.
    Changes applied before test
    commit 45189cc35882d2238be8db5a7e965fe1f414e98c
    Author: Antoine R. Dumont (@ardumont) <ardumont@softwareheritage.org>
    Date:   Tue Sep 22 17:31:31 2020 +0200
    
        core.loader: Log information about origin currently being ingested
        
        Prior to this commit, we "info" logged no meaningful information about the
        origin being loaded. With the following, we currently will have more context
        about the status, type and origin being loaded.
        
        This concerns the dvcs loaders for now (git, svn, mercurial)

    See https://jenkins.softwareheritage.org/job/DLDBASE/job/tests-on-diff/301/ for more details.

  • Better log message

  • Build is green

    Patch application report for D4012 (id=14157)

    Rebasing onto b5d607989b...

    First, rewinding head to replay your work on top of it...
    Fast-forwarded diff-target to base-revision-82-!160.
    Changes applied before test

    See https://jenkins.softwareheritage.org/job/DCORE/job/tests-on-diff/82/ for more details.

  • Improve log message

  • Build is green

    Patch application report for D4012 (id=14158)

    Rebasing onto 452fa224...

    Current branch diff-target is up to date.
    Changes applied before test
    commit b563bd610e92564eb74d1b8106fe09a07e8acc44
    Author: Antoine R. Dumont (@ardumont) <ardumont@softwareheritage.org>
    Date:   Tue Sep 22 17:31:31 2020 +0200
    
        core.loader: Log information about origin currently being ingested
        
        Prior to this commit, we "info" logged no meaningful information about the
        origin being loaded. With the following, we currently will have more context
        about the status, type and origin being loaded.
        
        This concerns the dvcs loaders for now (git, svn, mercurial)

    See https://jenkins.softwareheritage.org/job/DLDBASE/job/tests-on-diff/302/ for more details.

  • I don't think the origin url and visit type should be sent in the task result; they're arguments of the task already.

    If we want them logged by the worker when the task ends (which I agree would be useful), then we should improve logging on the worker/celery side to show some of the task arguments (for instance, if there's a "url" argument) instead / in addition of the task id.

  • I don't think the origin url and visit type should be sent in the task result; they're arguments of the task already.

    Yes, ok.

    Except that information is not immediately seen in logs output, or is it?

    If we want them logged by the worker when the task ends (which I agree would be useful), then we should improve logging on the worker/celery side to show some of the task arguments (for instance, if there's a "url" argument) instead / in addition of the task id.

    Yes, that sounds like the right place. Although, I don't really know what code to change now. My first thought would be to look into the scheduler though (but not quite sure where in there yet).


    To be clear, my main issue today, when I try to look through our logs to investigate or plain read what's going on (after a deployment for example), I don't have any clues immediately...

    In my mind, the kibana information is not enough by itself, so i think i need to cross information with say sentry to have some more context... It's currently quite frustrating... up to an eventual point of, "oh well, I have some other urgent matters somewhere else..." (sometimes I push through but sometimes, I fail).

  • ! In !160 (closed), @olasd wrote: I don't think the origin url and visit type should be sent in the task result; they're arguments of the task already.

    If we want them logged by the worker when the task ends (which I agree would be useful), then we should improve logging on the worker/celery side to show some of the task arguments (for instance, if there's a "url" argument) instead / in addition of the task id.

    +1

  • ok, i'll adapt i would not be against a nudge in the right direction to actually improve the logging

  • Simplify to just one log statement

    (the other improvment should go deeper in the stack)

  • Build is green

    Patch application report for D4012 (id=14223)

    Rebasing onto 7ed1927b...

    First, rewinding head to replay your work on top of it...
    Applying: core.loader: Log information about origin currently being ingested
    Changes applied before test
    commit c5ce6ef3d34d234ea1cb5937a7345c6da2e5fda1
    Author: Antoine R. Dumont (@ardumont) <ardumont@softwareheritage.org>
    Date:   Thu Sep 24 16:02:46 2020 +0200
    
        core.loader: Log information about origin currently being ingested
        
        Prior to this commit, we "info" logged no meaningful information about the
        origin being loaded. With the following commit, we will be able to tell what
        origin/visit-type is happening.

    See https://jenkins.softwareheritage.org/job/DLDBASE/job/tests-on-diff/303/ for more details.

  • ! In !160 (closed), @ardumont wrote: To be clear, my main issue today, when I try to look through our logs to investigate or plain read what's going on (after a deployment for example), I don't have any clues immediately...

    In my mind, the kibana information is not enough by itself, so i think i need to cross information with say sentry to have some more context... It's currently quite frustrating... up to an eventual point of, "oh well, I have some other urgent matters somewhere else..." (sometimes I push through but sometimes, I fail).

    Yeah, I totally get it.

    Our logging handler swh.core.logger.JournalHandler already knows how to pull some metadata from the celery tasks:

    https://forge.softwareheritage.org/source/swh-core/browse/master/swh/core/logger.py;v0.3.0$44-49

    This data appears as extra fields in the logs handled by systemd-journald (you can use, for instance, journalctl -f SWH_TASK_NAME=swh.loader.git.tasks.UpdateGitRepository), and therefore end up as named fields in the messages stored in kibana (I guess the field name ends up lowercased there).

    rDCORE806f583 has dropped task arguments from the default systemd logging because:

    • some tasks have long lists of arguments that cause problems to the logging pipeline (because elasticsearch has a limit of 1000 fields per index, so flattening large argument lists can make it explode)
    • some tasks have arguments that can't be properly serialized at the bottom of the logging pipeline (elasticsearch doesn't like arbitrary bytes)

    I think it would work to hardcode a few task argument fields, now that we've normalized a lot of them:

    • task.kwargs['url'] would be useful for all loaders and some listers
    • task.kwargs['instance'] would be useful for some listers too.

    I've started doing this and documenting it a little better, I'll submit a diff ASAP.

    I would prefer to have a hook in the celery logging infrastructure to add explicit extra fields to the logging calls (because doing it at the handler level makes this only happen with the systemd journald handler, that we have in prod, but it will not happen in other contexts), but I'm not even sure that's doable. Using the "ambient" current_task is a bit ugly but it works...

  • Adapt according to suggestion

  • Build is green

    Patch application report for D4012 (id=14257)

    Rebasing onto 7ed1927b...

    Current branch diff-target is up to date.
    Changes applied before test
    commit bdd29d41e0aed612da349efd10d80c6243e43c85
    Author: Antoine R. Dumont (@ardumont) <ardumont@softwareheritage.org>
    Date:   Thu Sep 24 16:02:46 2020 +0200
    
        core.loader: Log information about origin currently being ingested
        
        Prior to this commit, we "info" logged no meaningful information about the
        origin being loaded. With the following commit, we will be able to tell what
        origin/visit-type is happening.
        
        This also adds some extra assertion failure message to explicit why the
        self.origin and self.visit should be set and where to look if not.

    See https://jenkins.softwareheritage.org/job/DLDBASE/job/tests-on-diff/304/ for more details.

  • mentioned in merge request swh-core!169 (closed)

  • Merge request was accepted

  • Nicolas Dandrimont approved this merge request

    approved this merge request

  • Our logging handler swh.core.logger.JournalHandler already knows how to pull some metadata from the celery tasks: ...

    Thanks a lot for clarifying that part!

  • Merge request was merged

Please register or sign in to reply
Loading