Etcd service status: check for certs error

The script /etc/init.d/etcd is used by the service manager for management of the etcd service. The call '/etc/init.d/etcd status' uses etcdctl health API to determine if the service is running fine or not. In an event if etcd certs are replaced with new ones but the service has not yet been restarted to use new ones, the status call will fail even though the service is running fine and the service manager will treat that as service is failed. 'sm-audit' (which is run periodically) uses '/etc/init.d/etcd status' call to determine and maintain the service health. Service manager receiving false service status may introduce a lot bugs. One such scenario is that 'sm' ignores the 'service restart' call if it thinks service is disabled. This leads to etcd not being restarted with new certs during upgrade activate and not being reachable to the kube-apiserver (which may have started using new client certs). This change modifies '/etc/init.d/etcd status' call to not just rely on etcd health api to determine if the etcd service is running and checks for the existence of etcd runtime information in case the health api fails with the 'bad certificate' error. Test Plan: PASS: Replace old certs with new certs at /etc/etcd/ and do not restart the service. Check that the '/etc/init.d/etcd status' is 'running'. PASS: Replace old certs with new certs at /etc/etcd/ and restart the service. Check that the '/etc/init.d/etcd status' is 'running'. Closes-Bug: 2033942 Change-Id: Id30a262ca1bde6d8acb85de10882ca9bd4b59bdd Signed-off-by: kaustubh.dhokte <kaustubh.dhokte@windriver.com>
2023-09-02 01:48:14 +00:00 · 2023-09-02 01:48:14 +00:00 · 3ffe8b7e1e
commit 3ffe8b7e1e
parent 8722928985
1 changed files with 20 additions and 2 deletions
--- a/puppet-manifests/src/modules/platform/files/etcd
+++ b/puppet-manifests/src/modules/platform/files/etcd
@ -44,12 +44,30 @@ ETCD_LISTEN_CLIENT_URL="${URLS[-1]}"
 status()
 {
    if [[ $ETCD_LISTEN_CLIENT_URL =~ "https" ]]; then
-        etcd_health="$(etcdctl --timeout 5s --ca-file /etc/etcd/ca.crt -cert-file /etc/etcd/etcd-server.crt --key-file /etc/etcd/etcd-server.key --endpoints="$ETCD_LISTEN_CLIENT_URL" cluster-health 2>&1 | head -n 1)"
+        etcd_health="$(etcdctl --timeout 5s --ca-file /etc/etcd/ca.crt -cert-file /etc/etcd/etcd-server.crt --key-file /etc/etcd/etcd-server.key --endpoints="$ETCD_LISTEN_CLIENT_URL" cluster-health 2>&1)"
    else
        etcd_health="$(etcdctl --timeout 5s --endpoints="$ETCD_LISTEN_CLIENT_URL" cluster-health 2>&1 | head -n 1)"
    fi

-    if [[ $etcd_health =~ "is healthy" ]]; then
+    # LP: 2033942. In case if the status method is called in between
+    # certs are replaced and etcd service is restarted, etcd health call
+    # will result negative even though service is running fine.
+    # In this case we rely on PID file for the status of the service.
+    if [[ $etcd_health =~ "bad certificate"  ]]; then
+        if [ -e $PIDFILE ]; then
+            PIDDIR=/proc/$(cat $PIDFILE)
+            if [ -d $PIDDIR ]; then
+                RETVAL=0
+                echo "$DESC is running but invalid certificates detected."
+                return
+            fi
+            echo "$DESC is Not running. Also, invalid certificates detected."
+            RETVAL=1
+        else
+            echo "$DESC is Not running. Also, invalid certificates detected."
+            RETVAL=1
+        fi
+    elif [[ $etcd_health =~ "is healthy" ]]; then
        RETVAL=0
        echo "$DESC is running"
        return