Upgrade rancher to 2.7.x
We are quite late with the rancher version.
We should upgrade to a more recent version which will allow use to upgrade the admin cluster to kubernetes 1.24+ and test for the other clusters
The changelogs should be checked carefully to detect any issue we could have
(The current version of rancher is 2.6.9)
Designs
- Show closed items
Activity
-
Newest first Oldest first
-
Show all activity Show comments only Show history only
- Vincent Sellier changed milestone to %MRO 2023
changed milestone to %MRO 2023
- Vincent Sellier added kubernetes rancher upgrade labels
added kubernetes rancher upgrade labels
- Vincent Sellier assigned to @vsellier
assigned to @vsellier
- Vincent Sellier mentioned in commit vsellier/swh-docs@217c0fd0
mentioned in commit vsellier/swh-docs@217c0fd0
- Vincent Sellier marked this issue as related to #4460 (closed)
marked this issue as related to #4460 (closed)
- Vincent Sellier mentioned in merge request swh/devel/swh-docs!374 (merged)
mentioned in merge request swh/devel/swh-docs!374 (merged)
- Author Owner
Nothing blocking seems to be tracked in the changelogs:
- v2.6.10: https://github.com/rancher/rancher/releases/tag/v2.6.10
- ok
- cert-manager must be > 0.9.1 (1.10.0)
- v2.6.11: https://github.com/rancher/rancher/releases/tag/v2.6.11
- ok
- Fix the cpu consumption issue on rke1 / 1.24 clusters
- v2.6.12: https://github.com/rancher/rancher/releases/tag/v2.6.12
- ok
- v2.6.13: https://github.com/rancher/rancher/releases/tag/v2.6.13
- ok
- v2.7.0: https://github.com/rancher/rancher/releases/tag/v2.7.0
- ok
- v2.7.1: https://github.com/rancher/rancher/releases/tag/v2.7.1
- ok
- v2.7.2: https://github.com/rancher/rancher/releases/tag/v2.7.2
- ok
- One warning if a rollback to a <v2.7.2 is done
- v2.7.3: https://github.com/rancher/rancher/releases/tag/v2.7.3
- ok (security fix release)
- v2.7.4: https://github.com/rancher/rancher/releases/tag/v2.7.4
- ok (security fix release)
- v2.7.5: https://github.com/rancher/rancher/releases/tag/v2.7.5
- ok
- Autodetect and adapt according the cluster on which rancher is running (<1.25 or >=1.25)
It seems we could perform a direct upgrade to v2.7.5.
The rancher backup is up-to-date and uploaded to minio. I will manually retrieve the last backups of each rancher cluster and test the restore in a local cluster.
It also seems the argo cd auto health option should be disabled. It's useful to know if something changed in the cluster but it's not usable to perform the upgrade through argocdcd as it use a rancher url which will (probably) stop reply during the upgrade
Let's repeat this locally :)
Edited by Vincent Sellier - v2.6.10: https://github.com/rancher/rancher/releases/tag/v2.6.10
- Author Owner
before starting the rancher upgrade, an upgrade of the node image was launched
Collapse replies - Author Owner
upgrade of the node's image done
- Author Owner
For the record, the etcd snapshots of the rke2 clusters are stored in
/var/liv/rancher/rke2/server/db/snapshots
:root@rancher-node-staging-rke2-mgmt1:/var/lib# ls -al /var/lib/rancher/rke2/server/db/snapshots/ total 97038 drwx------ 2 root root 11 Aug 9 16:01 . drwx------ 4 root root 4 Aug 8 13:05 .. -rw------- 1 root root 33447968 Aug 8 20:00 etcd-snapshot-rancher-node-staging-rke2-mgmt1-1691524800 -rw------- 1 root root 33665056 Aug 9 00:00 etcd-snapshot-rancher-node-staging-rke2-mgmt1-1691539200 -rw------- 1 root root 33828896 Aug 9 05:00 etcd-snapshot-rancher-node-staging-rke2-mgmt1-1691557200 -rw------- 1 root root 42254368 Aug 9 10:00 etcd-snapshot-rancher-node-staging-rke2-mgmt1-1691575200 -rw------- 1 root root 42254368 Aug 9 15:00 etcd-snapshot-rancher-node-staging-rke2-mgmt1-1691593200 -rw------- 1 root root 29945888 Jun 6 08:22 on-demand-rancher-node-staging-rke2-mgmt1-1686039753 -rw------- 1 root root 29917216 Jun 6 13:19 on-demand-rancher-node-staging-rke2-mgmt1-1686057540 -rw------- 1 root root 29974560 Jun 6 14:58 on-demand-rancher-node-staging-rke2-mgmt1-1686063488 -rw------- 1 root root 42254368 Aug 9 16:01 on-demand-rancher-node-staging-rke2-mgmt1-1691596905
and in
/opt/rke/etcd-snapshots
for rke cluster:root@rancher-node-admin-mgmt1:/opt/rke/etcd-snapshots# ls -al total 164108 drwxr-xr-x 2 root root 4096 Aug 9 16:01 . drwxr-xr-x 3 root root 4096 Nov 7 2022 .. -rw------- 1 root root 10111498 Nov 27 2022 2022-11-27T22:18:57Z_etcd.zip -rw------- 1 root root 10066351 Nov 28 2022 2022-11-28T10:18:57Z_etcd.zip -rw------- 1 root root 10135317 Nov 28 2022 2022-11-28T22:18:57Z_etcd.zip -rw------- 1 root root 10168034 Nov 29 2022 2022-11-29T10:18:57Z_etcd.zip -rw------- 1 root root 10376599 Nov 29 2022 2022-11-29T22:18:57Z_etcd.zip -rw------- 1 root root 10305302 Nov 30 2022 2022-11-30T10:18:57Z_etcd.zip -rw------- 1 root root 14854502 Aug 9 16:01 c-q2wd4-ml-69vhw_2023-08-09T16:01:29Z.zip -rw------- 1 root root 15387567 Aug 6 20:20 c-q2wd4-rl-2k8jv_2023-08-06T20:20:04Z.zip -rw------- 1 root root 15083289 Aug 8 08:35 c-q2wd4-rl-2tqqp_2023-08-08T08:35:04Z.zip -rw------- 1 root root 14893920 Aug 9 08:45 c-q2wd4-rl-pfwvx_2023-08-09T08:45:04Z.zip -rw------- 1 root root 15860287 Aug 7 08:25 c-q2wd4-rl-vp57m_2023-08-07T08:25:04Z.zip -rw------- 1 root root 15085677 Aug 8 20:40 c-q2wd4-rl-vrb4d_2023-08-08T20:40:04Z.zip -rw------- 1 root root 15682621 Aug 7 20:30 c-q2wd4-rl-wnbkg_2023-08-07T20:30:04Z.zip
- Author Owner
Results of the local tests:
- The upgrade is performed with a rolling upgrade so the it should be ok to do it via argocd
- a local restore of a backup is not stable because a couple of checks are done on the clusters. As the local installation can't be reached by the clusters, they are removed from the interface.
- The local migration tests done:
- Install rancher on a kubernetes 1.24.6 cluster
- Create a rke cluster and rke2 cluster (with no nodes)
- Install the monitoring application
- Migrate to rancher
2.7.5
- Upgrade the monitoring application It's not a 100% bullet proof test but at least, it seems to work.
For the record, rancher
2.6.9
doesn't want to be installed on a cluster >= 1.25.0The recommended kubernetes versions for
2.7.5
is >1.23
and <=1.26
: https://www.suse.com/suse-rancher/support-matrix/all-supported-versions/rancher-v2-7-5/Edited by Vincent Sellier - Author Owner
After fighting with the backup, I finally succeeded to restore it locally.
The problem was https://github.com/rancher/backup-restore-operator/pull/367 . After forging a new backup archive from the last backup, the restore can be done until the end.
$ tar tvzf ../recurring-to-minio-df22cac6-f1a2-4bc9-9a89-068b733a848a-2023-08-09T22-00-00Z.tar.gz $ find . -name "cluster-fleet-default-*" | xargs rm -rf $ tar cvzf ../fixed.tar.gz *
The migration was retested with the full data and everything looks correct (until the communication to the clusters).
- Vincent Sellier mentioned in commit swh/infra/ci-cd/k8s-clusters-conf@f5d5d93b
mentioned in commit swh/infra/ci-cd/k8s-clusters-conf@f5d5d93b
- Author Owner
the server was upgraded correctly, the admin, staging and production rancher component was automatically updated too.
Everything looks stable so far.
Let's upgrade the monitoring app now to remove the version compatibility alert:
cattle-system/rancher-7b7d55569f-gqlvf[rancher]: W0810 13:16:55.477343 33 warnings.go:80] policy/v1beta1 PodSecurityPolicy is deprecated in v1.21+, unavailable in v1.25+
Collapse replies - Author Owner
the fix was done in https://github.com/prometheus-community/helm-charts/commit/efe34752da5e60438d2bace7ef84a0e8a7b852f1 and release in prometheus-stack:
42.0.0
(the current deployed version is41.4.0
- Author Owner
The last available version of prometheus-stack helm chart is
48.3.1
.Unfortunately, starting from
45.11.0
, aseccompProfile
is used in the chart configuration. It's supported by default in 1.25+ clusters. (https://kubernetes.io/docs/reference/command-line-tools-reference/feature-gates/SeccompDefault
feature gate)The simplest solution is to update to
45.10.1
(the removal of thePodSecurityPolicy
was done in41.9.1
)It will be possible to upgrade to a more recent version when the clusters will be updated to 1.25+
Edited by Vincent Sellier - Author Owner
the crds was manually updated as explained at
kubectl apply --force-conflicts --server-side -f https://raw.githubusercontent.com/prometheus-operator/prometheus-operator/v0.63.0/example/prometheus-operator-crd/monitoring.coreos.com_alertmanagerconfigs.yaml kubectl apply --force-conflicts --server-side -f https://raw.githubusercontent.com/prometheus-operator/prometheus-operator/v0.63.0/example/prometheus-operator-crd/monitoring.coreos.com_alertmanagers.yaml kubectl apply --force-conflicts --server-side -f https://raw.githubusercontent.com/prometheus-operator/prometheus-operator/v0.63.0/example/prometheus-operator-crd/monitoring.coreos.com_podmonitors.yaml kubectl apply --force-conflicts --server-side -f https://raw.githubusercontent.com/prometheus-operator/prometheus-operator/v0.63.0/example/prometheus-operator-crd/monitoring.coreos.com_probes.yaml kubectl apply --force-conflicts --server-side -f https://raw.githubusercontent.com/prometheus-operator/prometheus-operator/v0.63.0/example/prometheus-operator-crd/monitoring.coreos.com_prometheuses.yaml kubectl apply --force-conflicts --server-side -f https://raw.githubusercontent.com/prometheus-operator/prometheus-operator/v0.63.0/example/prometheus-operator-crd/monitoring.coreos.com_prometheusrules.yaml kubectl apply --force-conflicts --server-side -f https://raw.githubusercontent.com/prometheus-operator/prometheus-operator/v0.63.0/example/prometheus-operator-crd/monitoring.coreos.com_servicemonitors.yaml kubectl apply --force-conflicts --server-side -f https://raw.githubusercontent.com/prometheus-operator/prometheus-operator/v0.63.0/example/prometheus-operator-crd/monitoring.coreos.com_thanosrulers.yaml
- The argocd application was manually synced with force and replace activated because due to the crd change, the diff between 41.x and 45.x was not working and the sync failing
- Vincent Sellier mentioned in commit swh/infra/ci-cd/k8s-clusters-conf@a39f1518
mentioned in commit swh/infra/ci-cd/k8s-clusters-conf@a39f1518
- Author Owner
The pod security template related to the prometheus operator is not present anymore.
It still remains old calico and tigera operators psps but they seems to be unused:
kubectl --context euwest-rancher get pods \ --all-namespaces \ --output jsonpath='{.items[*].metadata.annotations.kubernetes\.io\/psp}' \ | tr " " "\n" | sort -u | wc -l 0
On a fresh install, the psp are not created so it looks like some remaining of an old installation. Let's try to remove them (a backup is attached to this comment)
Collapse replies - Author Owner
The tigera psp were correctly removed but the calico one are automatically recreated:
kubectl --context euwest-rancher get podsecuritypolicies Warning: policy/v1beta1 PodSecurityPolicy is deprecated in v1.21+, unavailable in v1.25+ NAME PRIV CAPS SELINUX RUNASUSER FSGROUP SUPGROUP READONLYROOTFS VOLUMES calico-kube-controllers false RunAsAny MustRunAsNonRoot MustRunAs MustRunAs false configMap,emptyDir,projected,secret,downwardAPI,persistentVolumeClaim calico-node true RunAsAny RunAsAny MustRunAs MustRunAs false configMap,emptyDir,projected,secret,downwardAPI,persistentVolumeClaim,hostPath calico-typha false RunAsAny MustRunAsNonRoot MustRunAs MustRunAs false configMap,emptyDir,projected,secret,downwardAPI,persistentVolumeClaim
Calico is managed by aks, let's dig this
- Author Owner
looks like the case is correctly handled by the azure migration: https://github.com/Azure/AKS/issues/3394
I guess the only way to test is to try the upgrade to 1.25...
- Author Owner
interesting page on the migration of the PSP: https://kubernetes.io/docs/tasks/configure-pod-container/migrate-from-psp/
- Author Owner
Other interesting message during the migration preparation in the azure ui:
Kubernetes a supprimé des objets d’ApiGroups entre les versions 1.24.6 et 1.25.11. Si vous avez des ressources appelant les ApiGroups ci-dessous, pensez à les migrer à l’avance vers de nouveaux ApiGroups pour éviter tout conflit.En savoir plus CronJob - batch/v1beta1 EndpointSlice - discovery.k8s.io/v1beta1 Event - events.k8s.io/v1beta1 HorizontalPodAutoscaler - autoscaling/v2beta1 PodDisruptionBudget - policy/v1beta1 PodSecurityPolicy - policy/v1beta1 RuntimeClass - node.k8s.io/v1beta1
Collapse replies - Author Owner
none on these are used in the cluster (except the remaining calico psps)
- Author Owner
upgrade to 1.25 done correctly.
Next step 1.25.11 -> 1.26.6
- Author Owner
upgrade to 1.26 done.
Not upgrading to 1.27 as the recommended version is 1.26 for rancher.
- Vincent Sellier closed
closed
- Vincent Sellier mentioned in issue #4999 (closed)
mentioned in issue #4999 (closed)
- Vincent Sellier marked this issue as related to #4999 (closed)
marked this issue as related to #4999 (closed)
- Vincent Sellier added 4h of time spent at 2023-08-09
added 4h of time spent at 2023-08-09