Skip to main content
Skip table of contents

Pausing Container Restarts

Problem

In certain cases a pod will continuously restart. There are various situations where this could occur and some examples are:

  • corrupted file system

  • inability to start services due to service dependency not being satisfied

  • LIveness/Readiness probe failure

Sometimes it is required to start the pod to access the file system to make changes such as config corrections or apply fixes. The following process defines how to do this.

NOTE: Be VERY careful performing these actions. They can have critical impact on the infrastructure and the potential impact should not be underestimated. If you do not feel comfortable performing these changes please log a ticket with the SIEMonster support team (costs apply). Please also note the following critical considerations:

  1. Only code safe editors are ever to be used e.g. notepad++, vs code or similar. Wordpad, windows notepad etc should never be used, not even in emergencies.

  2. The following instructions deal with editing YAML configuration files. YAML files are considered corrupted if a TAB is used (accidentally or purposefully) and should be discarded and a fresh file used. Only spaces are allowed.

 Solution

Prerequisites:

The following are required for this solution:

  • kubectl access with access to the cluster with the failing pod

  • Linux/Vim/YAML knowledge

To pause the pod the following steps need to be taken:

  1. First establish which pod is constantly restarting

    1. It can be in multiple different states e.g. container creating, crashloopback off or even running for short periods of time.

    2. Check the restart counter to see if it is incrementing.

  2. Once the container has been identified look at the ending characters to determine if it’s a statefulset or a deployment. Stateful sets will always have the pod ending with a single integer e.g. -0 or -1 etc. Deployments will end with a random character sequence at the end as per screenshot below.

  3. To work with a statefulset, remove the hyphen and the integer and the end. With a pod from a deployment remove the hyphen and the random strings generated (please refer to the screenshot above for the example). Examples below.

    BASH
    #For statefulsets
    test75-qa-misp-0 becomes test75-qa-misp
    
    #For deployments
    test75-qa-opencti-connector-amitt-6b6d486d9-fr4mj becomes test75-qa-opencti-connector-amitt
    

    We shall refer to the output of the above as pod_variable for simplicity

  4. The next step is to verify the name of the deployment or sts. Continuing with the examples above you can tie that information in with the following examples.

    BASH
    #To verify the STS (Statefulset) name
    kubectl -n test75-qa get sts |grep test75-qa-misp
    #The above will provide you with one or more results. Such as below
    test75-qa-misp         1/1     35d
    test75-qa-misp-db      1/1     35d
    
    #To verify the deployment name
    kubectl -n test75-qa get deployment |grep test75-qa-opencti-connector-amitt
    #The above will provide you with output such as the below
    test75-qa-opencti-connector-amitt            1/1     1            1           35d
  5. With the above information verified, a backup of the item that will be edited now needs to be created. This is to ensure that the service can be restored should mistakes be made in the steps below. Based on the above pod examples, the following commands will generate the required backups

    BASH
    #For STS backups
    kubectl -n test75-qa get sts test75-qa-misp -o yaml > <path_of_backup/test75-qa-misp-bak.yaml-<date>-<user>>
    #e.g.
    kubectl -n test75-qa get sts test75-qa-misp -o yaml > /home/ec-user/test75-qa-misp-bak.yaml-20231213-SM 
    #Perform a cat command to ensure the file contents is correct.
    cat <specified_path>/test75-qa-misp-bak.yaml-20231213-SM
    #NOTE: The -o yaml is not zero but the letter o. 
    
    #For Deployment backups
    #For STS backups
    kubectl -n test75-qa get deployment test75-qa-opencti-connector-amitt -o yaml > <path_of_backup/test75-qa-opencti-connector-amitt.yaml-<date>-<user>>
    #e.g.
    kubectl -n test75-qa get deployment test75-qa-opencti-connector-amitt -o yaml > /home/ec-user/test75-qa-opencti-connector-amitt.yaml-20231213-SM 
    #Perform a cat command to ensure the file contents is correct.
    cat <specified_path>/test75-qa-opencti-connector-amitt-20231213-SM
  6. Now that the backups have been made the production item can be edited. With the exception of the kubectl command the changes inside the yaml are identical. To avoid confusion the following will only contain one edit.

BASH
#Edit the sts
kubectl -n test75-qa edit sts test75-qa-misp

#or

#Edit the deployment
kubectl -n test75-qa edit deployment test75-qa-opencti-connector-amitt
  1. when the editor opens, move to the line that start with imagePullPolicy

  2. Press [Insert] and then Press [End], this will move you to the end of the line. Then Press [ENTER]. It should look like below

  3. Now move the cursor to be inline with imagePullPolicy and paste the following

    YAML
    command: ["/bin/bash","-c","--"]
    args: ["while true; do sleep 30; done;"]
  4. The configuration should now look like the following

  5. Search for the line that is called initialDelaySeconds and add a 0 to the end.

  6. Press [ESC] then type :wq and Press [ENTER]

    1. If no mistakes were present and the system accepts the change you will drop out of the editor completely without error. If errors or issues are present the editor will go back to config. If you are unsure of an error that was made it is safest to Press [ESC] then type :q! which will quite without saving. Please restart the edit process above.

  7. If the changes were successful, the pod may automatically restart and enter a running state. If it doesn’t please delete the pod and wait for it to enter the running state. You can now enter the pod with the usual commands.
    🔖 NOTE: The above prevents the normal services from starting in the pod. You are able to make configuration changes inside the pod and repair file structure issues etc.

  8. Once your changes or repairs are complete you can perform the following steps to roll back the pause.

  9. Edit the relevant statefulset or deployment

  10. Remove the lines above env: and below containers are per screenshot below (pressing d twice will completely remove a line at a time and is the safest option)

  11. Modify the env line to line up with the containers line and then add a preceding hyphen and space as per below screenshot

  12. Press [ESC] and then type :wq and Press [ENTER]

  13. Please repeat step 13.

JavaScript errors detected

Please note, these errors can depend on your browser setup.

If this problem persists, please contact our support.