core.loader: Log information about origin currently being ingested
Prior to this commit, we "info" logged no meaningful information [1] about the origin being loaded.
- [1] The only information we have for now is what configuration file we loaded which i don't find satisfying... swh-core!168 (closed)
Test Plan
tox
Migrated from D4012 (view on Phabricator)
Merge request reports
Activity
Build is green
Patch application report for D4012 (id=14154)
Rebasing onto 452fa224...
Current branch diff-target is up to date.
Changes applied before test
commit 45189cc35882d2238be8db5a7e965fe1f414e98c Author: Antoine R. Dumont (@ardumont) <ardumont@softwareheritage.org> Date: Tue Sep 22 17:31:31 2020 +0200 core.loader: Log information about origin currently being ingested Prior to this commit, we "info" logged no meaningful information about the origin being loaded. With the following, we currently will have more context about the status, type and origin being loaded. This concerns the dvcs loaders for now (git, svn, mercurial)
See https://jenkins.softwareheritage.org/job/DLDBASE/job/tests-on-diff/301/ for more details.
Build is green
Patch application report for D4012 (id=14157)
Rebasing onto b5d607989b...
First, rewinding head to replay your work on top of it... Fast-forwarded diff-target to base-revision-82-!160.
Changes applied before test
See https://jenkins.softwareheritage.org/job/DCORE/job/tests-on-diff/82/ for more details.
Build is green
Patch application report for D4012 (id=14158)
Rebasing onto 452fa224...
Current branch diff-target is up to date.
Changes applied before test
commit b563bd610e92564eb74d1b8106fe09a07e8acc44 Author: Antoine R. Dumont (@ardumont) <ardumont@softwareheritage.org> Date: Tue Sep 22 17:31:31 2020 +0200 core.loader: Log information about origin currently being ingested Prior to this commit, we "info" logged no meaningful information about the origin being loaded. With the following, we currently will have more context about the status, type and origin being loaded. This concerns the dvcs loaders for now (git, svn, mercurial)
See https://jenkins.softwareheritage.org/job/DLDBASE/job/tests-on-diff/302/ for more details.
I don't think the origin url and visit type should be sent in the task result; they're arguments of the task already.
If we want them logged by the worker when the task ends (which I agree would be useful), then we should improve logging on the worker/celery side to show some of the task arguments (for instance, if there's a "url" argument) instead / in addition of the task id.
I don't think the origin url and visit type should be sent in the task result; they're arguments of the task already.
Yes, ok.
Except that information is not immediately seen in logs output, or is it?
If we want them logged by the worker when the task ends (which I agree would be useful), then we should improve logging on the worker/celery side to show some of the task arguments (for instance, if there's a "url" argument) instead / in addition of the task id.
Yes, that sounds like the right place. Although, I don't really know what code to change now. My first thought would be to look into the scheduler though (but not quite sure where in there yet).
To be clear, my main issue today, when I try to look through our logs to investigate or plain read what's going on (after a deployment for example), I don't have any clues immediately...
In my mind, the kibana information is not enough by itself, so i think i need to cross information with say sentry to have some more context... It's currently quite frustrating... up to an eventual point of, "oh well, I have some other urgent matters somewhere else..." (sometimes I push through but sometimes, I fail).
! In !160 (closed), @olasd wrote: I don't think the origin url and visit type should be sent in the task result; they're arguments of the task already.
If we want them logged by the worker when the task ends (which I agree would be useful), then we should improve logging on the worker/celery side to show some of the task arguments (for instance, if there's a "url" argument) instead / in addition of the task id.
+1
Build is green
Patch application report for D4012 (id=14223)
Rebasing onto 7ed1927b...
First, rewinding head to replay your work on top of it... Applying: core.loader: Log information about origin currently being ingested
Changes applied before test
commit c5ce6ef3d34d234ea1cb5937a7345c6da2e5fda1 Author: Antoine R. Dumont (@ardumont) <ardumont@softwareheritage.org> Date: Thu Sep 24 16:02:46 2020 +0200 core.loader: Log information about origin currently being ingested Prior to this commit, we "info" logged no meaningful information about the origin being loaded. With the following commit, we will be able to tell what origin/visit-type is happening.
See https://jenkins.softwareheritage.org/job/DLDBASE/job/tests-on-diff/303/ for more details.
! In !160 (closed), @ardumont wrote: To be clear, my main issue today, when I try to look through our logs to investigate or plain read what's going on (after a deployment for example), I don't have any clues immediately...
In my mind, the kibana information is not enough by itself, so i think i need to cross information with say sentry to have some more context... It's currently quite frustrating... up to an eventual point of, "oh well, I have some other urgent matters somewhere else..." (sometimes I push through but sometimes, I fail).
Yeah, I totally get it.
Our logging handler
swh.core.logger.JournalHandler
already knows how to pull some metadata from the celery tasks:https://forge.softwareheritage.org/source/swh-core/browse/master/swh/core/logger.py;v0.3.0$44-49
This data appears as extra fields in the logs handled by systemd-journald (you can use, for instance,
journalctl -f SWH_TASK_NAME=swh.loader.git.tasks.UpdateGitRepository
), and therefore end up as named fields in the messages stored in kibana (I guess the field name ends up lowercased there).rDCORE806f583 has dropped task arguments from the default systemd logging because:
- some tasks have long lists of arguments that cause problems to the logging pipeline (because elasticsearch has a limit of 1000 fields per index, so flattening large argument lists can make it explode)
- some tasks have arguments that can't be properly serialized at the bottom of the logging pipeline (elasticsearch doesn't like arbitrary bytes)
I think it would work to hardcode a few task argument fields, now that we've normalized a lot of them:
-
task.kwargs['url']
would be useful for all loaders and some listers -
task.kwargs['instance']
would be useful for some listers too.
I've started doing this and documenting it a little better, I'll submit a diff ASAP.
I would prefer to have a hook in the celery logging infrastructure to add explicit extra fields to the logging calls (because doing it at the handler level makes this only happen with the systemd journald handler, that we have in prod, but it will not happen in other contexts), but I'm not even sure that's doable. Using the "ambient"
current_task
is a bit ugly but it works...Build is green
Patch application report for D4012 (id=14257)
Rebasing onto 7ed1927b...
Current branch diff-target is up to date.
Changes applied before test
commit bdd29d41e0aed612da349efd10d80c6243e43c85 Author: Antoine R. Dumont (@ardumont) <ardumont@softwareheritage.org> Date: Thu Sep 24 16:02:46 2020 +0200 core.loader: Log information about origin currently being ingested Prior to this commit, we "info" logged no meaningful information about the origin being loaded. With the following commit, we will be able to tell what origin/visit-type is happening. This also adds some extra assertion failure message to explicit why the self.origin and self.visit should be set and where to look if not.
See https://jenkins.softwareheritage.org/job/DLDBASE/job/tests-on-diff/304/ for more details.
mentioned in merge request swh-core!169 (closed)