This guide helps you to recover a cluster when kubelet is down and/or apiserver DaemonSets are failing because of outdated certificates. The main symptom is, that all master nodes are NotReady and the cluster down.

Detect the symptom

  • Login into one of the master nodes
  • View pod status
    crictl pods
    
  • View the pod logs
    crictl logs -f ${POD_ID} 2>&1
    

Symptom: Login into cluster not possible via oauth-openshift

I may happen, that even the oauth-openshift is not operational and responding with 500 Internal Server Error. Every credentials based login is impossible.

  • Login into one of the master nodes
  • Export the fallback kubecontext
    export KUBECONFIG=/etc/kubernetes/static-pod-resources/kube-apiserver-certs/secrets/node-kubeconfigs/localhost-recovery.kubeconfig
    
  • Test cluster access
    oc get nodes
    

Symptom: apiserver: Unable to authenticate the request due to an error: x509: certificate has expired or is not yet valid

You can usually accept these certificates requests.

  • Login into cluster context
  • View open certificate requests
    oc get csr
    
  • Approve all outstanding certificates
    oc adm certificate approve
    

Symptom: kubelet certificates outdated

You can recover the kubelet certificates the following way.

  • Login into one of the master nodes
  • Locate the kubelet certificates
    ls /var/lib/kubelet/pki
    
  • Locate the kubelet CA and copy it to /var/lib/kubelet/pki/signer.crt
    ls /etc/kubernetes/kubelet-ca.crt
    
  • Locate the kubelet signer private key from etcd and copy the content of the value to a file named /var/lib/kubelet/pki/signer.key
    # Get the etcd pod id
    crictl ps | grep etcd
    # Enter the etcd pod
    crictl exec -it ${ETCD_POD_ID} bash
    export ETCDCTL_API=3
    # View all keys
    etcdctl get --keys-only --prefix=true "/kubernetes.io/secrets/openshift-kube-apiserver-operator"
    # Get key content
    etcdctl get "/kubernetes.io/secrets/openshift-kube-apiserver-operator/kube-apiserver-to-kubelet-signer"
    
  • Use the following shell script to generate new server and client certificates: okd-renew-kubelet-cert.sh
    okd-renew-kubelet-cert.sh /var/lib/kubelet/pki/current-client.pem
    okd-renew-kubelet-cert.sh /var/lib/kubelet/pki/current-server.pem
    
    If you dont want to use the script, make sure, that you generate a certificate the node’s FQDN as SAN and it’s IP.
  • Review the new generated PEM files
    openssl x509 -in kubelet-server-current.pem-new -text
    
    You should see something like this:
    Issuer: CN = openshift-kube-apiserver-operator_kube-apiserver-to-kubelet-signer@1703925636
    Validity
      Not Before: Aug  8 11:26:28 2024 GMT
      Not After : Nov  8 11:26:28 2042 GMT
    Subject: O = system:nodes, CN = system:node:master-node.example.com
    ...
    X509v3 extensions:
      X509v3 Subject Alternative Name:
          DNS:master-node.example.com, IP Address:x.x.x.x
    
  • Replace the certificate file symlinks, kubelet should use them immediately.

References