Openshift (OKD) cluster recovery when master nodes down due apiserver certificates expired
This guide helps you to recover a cluster when kubelet is down and/or apiserver DaemonSets are failing because of outdated certificates. The main symptom is, that all master nodes are NotReady and the cluster down.
Detect the symptom
- Login into one of the master nodes
- View pod status
crictl pods - View the pod logs
crictl logs -f ${POD_ID} 2>&1
Symptom: Login into cluster not possible via oauth-openshift
I may happen, that even the oauth-openshift is not operational and responding with 500 Internal Server Error. Every credentials based login is impossible.
- Login into one of the master nodes
- Export the fallback kubecontext
export KUBECONFIG=/etc/kubernetes/static-pod-resources/kube-apiserver-certs/secrets/node-kubeconfigs/localhost-recovery.kubeconfig - Test cluster access
oc get nodes
Symptom: apiserver: Unable to authenticate the request due to an error: x509: certificate has expired or is not yet valid
You can usually accept these certificates requests.
- Login into cluster context
- View open certificate requests
oc get csr - Approve all outstanding certificates
oc adm certificate approve
Symptom: kubelet certificates outdated
You can recover the kubelet certificates the following way.
- Login into one of the master nodes
- Locate the kubelet certificates
ls /var/lib/kubelet/pki - Locate the kubelet CA and copy it to
/var/lib/kubelet/pki/signer.crtls /etc/kubernetes/kubelet-ca.crt - Locate the kubelet signer private key from etcd and copy the content of the value to a file named
/var/lib/kubelet/pki/signer.key# Get the etcd pod id crictl ps | grep etcd # Enter the etcd pod crictl exec -it ${ETCD_POD_ID} bash export ETCDCTL_API=3 # View all keys etcdctl get --keys-only --prefix=true "/kubernetes.io/secrets/openshift-kube-apiserver-operator" # Get key content etcdctl get "/kubernetes.io/secrets/openshift-kube-apiserver-operator/kube-apiserver-to-kubelet-signer" - Use the following shell script to generate new server and client certificates: okd-renew-kubelet-cert.sh
If you dont want to use the script, make sure, that you generate a certificate the node’s FQDN as SAN and it’s IP.okd-renew-kubelet-cert.sh /var/lib/kubelet/pki/current-client.pem okd-renew-kubelet-cert.sh /var/lib/kubelet/pki/current-server.pem - Review the new generated PEM files
You should see something like this:openssl x509 -in kubelet-server-current.pem-new -textIssuer: CN = openshift-kube-apiserver-operator_kube-apiserver-to-kubelet-signer@1703925636 Validity Not Before: Aug 8 11:26:28 2024 GMT Not After : Nov 8 11:26:28 2042 GMT Subject: O = system:nodes, CN = system:node:master-node.example.com ... X509v3 extensions: X509v3 Subject Alternative Name: DNS:master-node.example.com, IP Address:x.x.x.x - Replace the certificate file symlinks, kubelet should use them immediately.