Upgrade k8s on AKS clusters

changed the description

Rancher 2.8.x supports up to kubernetes 1.28.

Our GitLab operator version (1.1.1) claims support for up to kubernetes 1.29

Rancher cluster upgraded to 1.28.9:

ᐅ az aks nodepool get-upgrades --resource-group euwest-rancher \                          
    --nodepool-name default \
    --cluster-name euwest-rancher --output table
KubernetesVersion    LatestNodeImageVersion                    Name     OsType    ResourceGroup
-------------------  ----------------------------------------  -------  --------  ---------------
1.28.9               AKSUbuntu-2204gen2containerd-202406.07.0  default  Linux     euwest-rancher

Staging GitLab cluster upgraded to 1.29.4:

ᐅ az aks nodepool get-upgrades --resource-group euwest-gitlab-staging \
    --nodepool-name default \
    --cluster-name euwest-gitlab-staging --output table
KubernetesVersion    LatestNodeImageVersion                    Name     OsType    ResourceGroup
-------------------  ----------------------------------------  -------  --------  ---------------------
1.29.4               AKSUbuntu-2204gen2containerd-202406.07.0  default  Linux     euwest-gitlab-staging

The kubernetes role is no more shown with kubectl:

ᐅ kb --context euwest-gitlab-staging get nodes
NAME                              STATUS   ROLES    AGE     VERSION
aks-default-31796401-vmss0000as   Ready    <none>   17m     v1.29.4
aks-default-31796401-vmss0000fl   Ready    <none>   15m     v1.29.4
aks-default-31796401-vmss0000gk   Ready    <none>   12m     v1.29.4
aks-default-31796401-vmss0000gw   Ready    <none>   3m17s   v1.29.4

ᐅ kb --context euwest-gitlab-staging describe nodes aks-default-31796401-vmss0000as | \
awk '/.*role=.*agent.*/'
                    kubernetes.azure.com/role=agent

ᐅ kb --context euwest-gitlab-production describe nodes aks-default-31036454-vmss00007m | \
awk '/.*role=.*agent.*/'
                    kubernetes.azure.com/role=agent
                    kubernetes.io/role=agent

There will be service interruptions when upgrading k8s in the production GitLab:

every time postgresql, redis, ... pods were drained en recreated on new nodes, pulling image took some time;
there were several errors due to insufficient memory and the Azure node pool autoscale took a long time.

Warning  FailedScheduling  7s    default-scheduler  0/3 nodes are available: 2 Insufficient memory, 3 Insufficient cpu. preemption: 0/3 nodes are available: 3 No preemption victims found for incoming pod.

added 2h of time spent

Feel free to bump the size of the production node pool before launching the upgrade(s) to avoid the memory issue.

The disruptiveness is kind of unavoidable with statefulsets

This should avoid the memory issue on GitLab production during the k8s upgrades:

ᐅ az aks show --resource-group euwest-gitlab-production \
--name euwest-gitlab-production --query agentPoolProfiles | \
jq -r '.[] | "count: \(.count)\nenableAutoScaling: \(.enableAutoScaling)
maxCount: \(.maxCount)\nminCount: \(.minCount)"'
count: 5
enableAutoScaling: true
maxCount: 6
minCount: 5

Production GitLab cluster upgraded to 1.29.4:

ᐅ az aks nodepool get-upgrades --resource-group euwest-gitlab-production \
    --nodepool-name default \
    --cluster-name euwest-gitlab-production --output table
KubernetesVersion    LatestNodeImageVersion                    Name     OsType    ResourceGroup
-------------------  ----------------------------------------  -------  --------  ------------------------
1.29.4               AKSUbuntu-2204gen2containerd-202406.07.0  default  Linux     euwest-gitlab-production

ᐅ kb --context euwest-gitlab-production get nodes
NAME                              STATUS   ROLES    AGE   VERSION
aks-default-31036454-vmss00007m   Ready    <none>   46m   v1.29.4
aks-default-31036454-vmss0000av   Ready    <none>   39m   v1.29.4
aks-default-31036454-vmss0000b6   Ready    <none>   33m   v1.29.4
aks-default-31036454-vmss0000c5   Ready    <none>   30m   v1.29.4
aks-default-31036454-vmss0000c8   Ready    <none>   28m   v1.29.4

Autoscaling parameters:

ᐅ az aks show --resource-group euwest-gitlab-production \
--name euwest-gitlab-production --query agentPoolProfiles | \
jq -r '.[] | "count: \(.count)\nenableAutoScaling: \(.enableAutoScaling)
maxCount: \(.maxCount)\nminCount: \(.minCount)"'
count: 5
enableAutoScaling: true
maxCount: 6
minCount: 4

added 2h of time spent

mentioned in commit swh-sysadmin-provisioning@52cb1100

There is a deprecation warning on terraform plan:

No changes. Your infrastructure matches the configuration.

Terraform has compared your real infrastructure against your configuration and found no differences, so no changes are needed.
╷
│ Warning: Argument is deprecated
│ 
│   with module.gitlab-staging.module.gitlab_aks_cluster.azurerm_monitor_diagnostic_setting.diagnostic,
│   on modules/kubernetes/main.tf line 54, in resource "azurerm_monitor_diagnostic_setting" "diagnostic":
│   54: resource "azurerm_monitor_diagnostic_setting" "diagnostic" {
│ 
│ `retention_policy` has been deprecated in favor of `azurerm_storage_management_policy` resource - to learn more https://aka.ms/diagnostic_settings_log_retention
│ 
│ (and 17 more similar warnings elsewhere)
╵

closed

Upgrade k8s on AKS clusters

Designs

Child items ...

Activity