Skip to content

Make container-based services' push their log to swh log infrastructure

Current static systemd services are currently configured with Environment=SWH_LOG_TARGET=journal. Which when routed to the journald handler adds some custom swh prefixed metadata [1].

New container-based service do not use systemd. We need to find a solution which allows to behave similarly.

Current work:

  • Adapt swh tools to allow some more log customization through configuration (see previous merge requests)
  • 1. Investigate opentelemetry solution as it's unifying management for logs, trace and (prometheus) metrics.
    • Entertain locally the solution (minikube)
    • Entertain in staging cluster an equivalent solution as ^
      • Install a staging elasticsearch instance to avoid polluting the main elasticsearch instance (+ no persistence)
      • Make some workers push their logs to that elasticsearch instance through opentelemetry
        • conclusion: no dynamic index, only 1 possible which is limited (our current setup uses dynamic daily indices)
  • 2. Entertain use of opentelemetry, fluentbit, elasticsearch
    • opentelemetry -> fluentbit: ok
    • blocked: fluentbit -> elasticsearch: not yet working
  • 1. Falling back to solution 1. and let it run to determine if that's enough or not
    • swh service: Indexed data can be seen in kibana [2]
    • system service: Indexed data kibana search [3]
    • Iterate over the logs pattern to parse
  • Determine how to deploy this
    • in each cluster (enabling of a chart per cluster, which allows index parameters to be set)
    • or in admin cluster (clueless about this one, vince knows more)
  • Use opentelemetry-helm-chart [5]
    • Make it work
    • Adapt to avoid memory limit killing pods
    • Activate metrics and let it run to determine if it's worth continuing using it (it's using a high level of memory currently)
    • Analyze metrics
  • swh/infra/ci-cd/swh-charts!49 (merged): Templatize the iteration work as a cluster-configuration chart (deployment per cluster)
  • Deploy to production cluster

[1] https://gitlab.softwareheritage.org/swh/meta/-/snippets/1436

[2] (idx: staging-logs) http://kibana0.internal.softwareheritage.org:5601/goto/655500e1078b6b56e4f1be24968780ae

[3] (idx: staging-system-logs) http://kibana0.internal.softwareheritage.org:5601/goto/3067b92e9bbeec627fa2e3b906dc4ea3

[4]

$ date; curl -s ${ES_SERVER}/_cat/indices | grep staging
Fri 21 Apr 2023 03:55:56 PM CEST
green open  staging-system-logs                 xLvGyopTSaqjuwh5C0iFoQ 1 1 11957486      0   9.7gb   4.4gb
green open  staging-logs                        AYiBFvYmQtCN8UdnA5KexQ 1 1   115878      0 154.7mb  59.6mb
date; curl -s ${ES_SERVER}/_cat/indices | grep staging
Fri 12 May 2023 11:49:43 AM CEST
green open  staging-system-logs                 xLvGyopTSaqjuwh5C0iFoQ 1 1 50884569      0   36.1gb  16.7gb  # system logs
green open  staging-swh-logs                    GAAlhiCFR5OU1LDTlUc4mw 1 1   259133      0  164.8mb    84mb  # hit and miss setup
green open  staging-logs                        AYiBFvYmQtCN8UdnA5KexQ 1 1 12110295      0    6.9gb   3.5gb  # swh logs

[5] https://github.com/open-telemetry/opentelemetry-helm-charts


Migrated from T4524 (view on Phabricator)

Edited by Antoine R. Dumont