Openshift (OKD) cluster recovery when master nodes down due apiserver certificates expired
This guide helps you to recover a cluster when kubelet is down and/or apiserver DaemonSets are failing because of outdated certificates. The main symptom is, that all master nodes are NotReady and the cluster down.
Detect the symptom
- Login into one of the master nodes
- View pod status
crictl pods
- View the pod logs
crictl logs -f ${POD_ID} 2>&1
Symptom: Login into cluster not possible via oauth-openshift
I may happen, that even the oauth-openshift
is not operational and responding with 500 Internal Server Error
. Every credentials based login is impossible.
- Login into one of the master nodes
- Export the fallback kubecontext
export KUBECONFIG=/etc/kubernetes/static-pod-resources/kube-apiserver-certs/secrets/node-kubeconfigs/localhost-recovery.kubeconfig
- Test cluster access
oc get nodes
Symptom: apiserver: Unable to authenticate the request due to an error: x509: certificate has expired or is not yet valid
You can usually accept these certificates requests.
- Login into cluster context
- View open certificate requests
oc get csr
- Approve all outstanding certificates
oc adm certificate approve
Symptom: kubelet certificates outdated
You can recover the kubelet certificates the following way.
- Login into one of the master nodes
- Locate the kubelet certificates
ls /var/lib/kubelet/pki
- Locate the kubelet CA and copy it to
/var/lib/kubelet/pki/signer.crt
ls /etc/kubernetes/kubelet-ca.crt
- Locate the kubelet signer private key from etcd and copy the content of the value to a file named
/var/lib/kubelet/pki/signer.key
# Get the etcd pod id crictl ps | grep etcd # Enter the etcd pod crictl exec -it ${ETCD_POD_ID} bash export ETCDCTL_API=3 # View all keys etcdctl get --keys-only --prefix=true "/kubernetes.io/secrets/openshift-kube-apiserver-operator" # Get key content etcdctl get "/kubernetes.io/secrets/openshift-kube-apiserver-operator/kube-apiserver-to-kubelet-signer"
- Use the following shell script to generate new server and client certificates: okd-renew-kubelet-cert.sh
If you dont want to use the script, make sure, that you generate a certificate the node’s FQDN as SAN and it’s IP.okd-renew-kubelet-cert.sh /var/lib/kubelet/pki/current-client.pem okd-renew-kubelet-cert.sh /var/lib/kubelet/pki/current-server.pem
- Review the new generated PEM files
You should see something like this:openssl x509 -in kubelet-server-current.pem-new -text
Issuer: CN = openshift-kube-apiserver-operator_kube-apiserver-to-kubelet-signer@1703925636 Validity Not Before: Aug 8 11:26:28 2024 GMT Not After : Nov 8 11:26:28 2042 GMT Subject: O = system:nodes, CN = system:node:master-node.example.com ... X509v3 extensions: X509v3 Subject Alternative Name: DNS:master-node.example.com, IP Address:x.x.x.x
- Replace the certificate file symlinks, kubelet should use them immediately.