Open
Milestone
Dynamic infrastructure [Roadmap - Tooling and infrastructure]
- Lead: vsellier
- Priority: high
- Effort: 2 PM
Description:
Setup a dynamically scalable infrastructure for Software Heritage services
Includes work:
-
Setup an elastic workers infrastructure -
Configure Kubernetes clusters -
Monitoring/Alerting solution for container-based services -
Ingest the logs of the dynamic components into the current elk infrastructure
KPIs:
-
Dashboard displaying the status of the dynamic components -
Number of listers running -
Number of loaders running -
RPC services status
-
-
Logs ingested and correctly parsed in kibana -
Clusters fully backuped
Unstarted Issues (open and unassigned)
4
- sysadm-environment · Federate prometheus instances through thanos
- sysadm-environment · [dynamic infra] Implement the loader wachdogs to restart the services when stucked
- sysadm-environment · [dynamic infra] Correctly manage Save Code Now workers autoscaling
- Meta · Setup alerting/monitoring tools for the elastic stack
Ongoing Issues (open and assigned)
2
Completed Issues (closed)
44
- sysadm-environment · [dynamic infra] Some loader git get stuck during the repositoring cloning step
- sysadm-environment · [rancher] Add 2 management nodes for the archive-production-rke2 cluster
- sysadm-environment · [rancher] Add 2 management nodes for the archive-staging-rke2 cluster
- sysadm-environment · Backup rancher's k8s etcd to minio
- sysadm-environment · Dynamic infrastructure
- sysadm-environment · Make container-based services' push their log to swh log infrastructure
- sysadm-environment · Upgrade ArgoCD to 2.10.1
- sysadm-environment · Provision persistent ceph storage for the rke2 clusters
- sysadm-environment · AddForgeNow emails not handled by the webapp
- sysadm-environment · Migrate banco static objstorage to a dynamic objstorage instance
- sysadm-environment · Production is (very) close to the current default kubelet 110 pods / node limit
- sysadm-environment · Deploy storage read-write rpc to dynamic infrastructure
- sysadm-environment · Adapt read-only objstorage instance to run on saam as a rancher agent
- sysadm-environment · production: Migrate vault workload to dynamic infrastructure
- sysadm-environment · Expose publically the dynamic production webapp (aka switch from moma webapp to dynamic infra webapp)
- sysadm-environment · production: Deploy deposit instance in elastic infra
- sysadm-environment · production: Deploy webapp (& dependent read-only services) to dynamic infra
- sysadm-environment · staging: Migrate swh.counter rpc to dynamic infra
- Helm charts for swh packages · staging: Migrate swh.counter rpc to dynamic infra
- sysadm-environment · Migrate remaining *storage services to staging dynamic infra
- sysadm-environment · staging: Deploy deposit rpc service
- sysadm-environment · production: Deploy rpc services to dynamic infrastructure
- sysadm-environment · Migrate azure workload (indexer, cooker) to aks (kubernetes in azure)
- sysadm-environment · [dynamic infra] Increase number of local path provisioner workers
- Helm charts for swh packages · Adapt checksum computation to limit the impact to the updated object
- swh-environment · [swh-charts] Adapt checksum computation to limit the impact of the updated object
- sysadm-environment · [alertmanager] [Blocked] Update the configuration to be able to receive alert from other namespace than cattle-monitoring-system
- sysadm-environment · [Hardware] Install the metal03 production compute node
- sysadm-environment · [alertmanager] Route InfoInhibitor alert to a null receiver
- sysadm-environment · [terraform/rancher] Migrate monitoring app deployment from terraform to swh-chart
- sysadm-environment · [archive-staging-rke2] Unstable prometheus
- sysadm-environment · [dynamic Infra] Add an upper memory limit on loader deployments
- sysadm-environment · [rancher] staging and production clusters admin regularly crash
- sysadm-environment · [dynamic infra] archive-production-rke2 prometheus regularly restarts
- sysadm-environment · [Dynamic infra] Configure pods priority
- puppet-swh-site · [k8s staging] Some pods are regularly evicted or in error
- sysadm-environment · [thanos] archive-staging-rke2 datastore store is not available
- sysadm-environment · staging: Deploy rpc services to dynamic infrastructure
- sysadm-environment · Deploy django applications (deposit, webapp) with swh-charts
- sysadm-environment · Specify image cleanup threshold on rancher rke2 cluster
- sysadm-environment · Migrate rke "admin" cluster to rke2 cluster
- sysadm-environment · Rancher is unstable since a couple of days
- Meta · Specify and setup proper log management
- sysadm-environment · Test RKE2 cluster for new nodes
Loading
Loading
Loading