Issue with Pending States in AWS Cluster
TLDR Jatin reported that Kubernetes pods get stuck in pending state when nodes go down in their AWS cluster. Despite providing kubectl describe
results and logs, Prashant couldn't specify the cause, citing deeper investigation into the cluster and k8s resources would be required.
Aug 07, 2023 (4 months ago)
Jatin
05:26 AMPrashant
12:55 PMcan you share
kubectl describe
of the pods and associated pvc if any?Aug 08, 2023 (4 months ago)
Jatin
05:13 AMInstallation with helm chart works fine but once a few node in our cluster goes down these pods get stuck in pending state
Signoz-frontend
Signoz-alertmanager
Signoz-query-service
Signoz-otel-collector
Signoz-otel-collector-mertics
All the above pods get stuck in init condition pod initialization
Signoz-frontend -> waits for query service to come up
signoz-alertmanager-> waits for query service to come up
Signoz-query-service
Signoz-otel-collector
Signoz-otel-collector-mertics
→*
*These wait for db to come up
PVS created
Jatin
05:14 AMJatin
05:15 AMPrashant
05:33 AMPrashant
05:33 AMJatin
05:37 AMLAST SEEN TYPE REASON OBJECT MESSAGE
6m26s Warning Unhealthy pod/test-zookeeper-0 Readiness probe failed:
49m Warning NodeNotReady pod/test-k8s-infra-otel-agent-qxtml Node is not ready
47m Warning FailedToUpdateEndpointSlices service/test-k8s-infra-otel-agent Error updating Endpoint Slices for Service apm/test-k8s-infra-otel-agent: node "ip-10-221-106-81.ap-south-1.compute.internal" not found
47m Normal Scheduled pod/test-k8s-infra-otel-agent-rzz4x Successfully assigned apm/test-k8s-infra-otel-agent-rzz4x to ip-10-221-104-71.ap-south-1.compute.internal
47m Normal SuccessfulCreate daemonset/test-k8s-infra-otel-agent Created pod: test-k8s-infra-otel-agent-rzz4x
47m Warning FailedCreatePodSandBox pod/test-k8s-infra-otel-agent-rzz4x Failed to create pod sandbox: rpc error: code = Unknown desc = [failed to set up sandbox container "5178434579de3efe72deba3f8b9af43fcc67cd4dbf58a425924db91ef1737fd7" network for pod "test-k8s-infra-otel-agent-rzz4x": networkPlugin cni failed to set up pod "test-k8s-infra-otel-agent-rzz4x_apm" network: add cmd: Error received from AddNetwork gRPC call: rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing dial tcp 127.0.0.1:50051: connect: connection refused", failed to clean up sandbox container "5178434579de3efe72deba3f8b9af43fcc67cd4dbf58a425924db91ef1737fd7" network for pod "test-k8s-infra-otel-agent-rzz4x": networkPlugin cni failed to teardown pod "test-k8s-infra-otel-agent-rzz4x_apm" network: del cmd: error received from DelNetwork gRPC call: rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing dial tcp 127.0.0.1:50051: connect: connection refused"]
47m Normal SandboxChanged pod/test-k8s-infra-otel-agent-rzz4x Pod sandbox changed, it will be killed and re-created.
47m Normal Pulling pod/test-k8s-infra-otel-agent-rzz4x Pulling image "docker.io/istio/proxyv2:1.11.8"
46m Normal Pulled pod/test-k8s-infra-otel-agent-rzz4x Successfully pulled image "docker.io/istio/proxyv2:1.11.8" in 12.660076873s (12.660104413s including waiting)
46m Normal Started pod/test-k8s-infra-otel-agent-rzz4x Started container istio-init
46m Normal Created pod/test-k8s-infra-otel-agent-rzz4x Created container istio-init
46m Normal Pulling pod/test-k8s-infra-otel-agent-rzz4x Pulling image "docker.io/otel/opentelemetry-collector-contrib:0.79.0"
46m Normal Created pod/test-k8s-infra-otel-agent-rzz4x Created container test-k8s-infra-otel-agent
46m Normal Pulled pod/test-k8s-infra-otel-agent-rzz4x Successfully pulled image "docker.io/otel/opentelemetry-collector-contrib:0.79.0" in 6.76949216s (6.76950013s including waiting)
46m Normal Created pod/test-k8s-infra-otel-agent-rzz4x Created container istio-proxy
46m Normal Started pod/test-k8s-infra-otel-agent-rzz4x Started container test-k8s-infra-otel-agent
46m Normal Pulled pod/test-k8s-infra-otel-agent-rzz4x Container image "docker.io/istio/proxyv2:1.11.8" already present on machine
46m Normal Started pod/test-k8s-infra-otel-agent-rzz4x Started container istio-proxy
46m Warning Unhealthy pod/test-k8s-infra-otel-agent-rzz4x Readiness probe failed: Get "http://10.221.104.65:15021/healthz/ready": dial tcp 10.221.104.65:15021: connect: connection refused
31m Warning NodeNotReady pod/test-k8s-infra-otel-agent-pnvqw Node is not ready
29m Normal SuccessfulCreate daemonset/test-k8s-infra-otel-agent Created pod: test-k8s-infra-otel-agent-tlr4x
29m Normal SuccessfulCreate daemonset/test-k8s-infra-otel-agent Created pod: test-k8s-infra-otel-agent-rrntf
29m Normal Scheduled pod/test-k8s-infra-otel-agent-rrntf Successfully assigned apm/test-k8s-infra-otel-agent-rrntf to ip-10-221-107-222.ap-south-1.compute.internal
29m Normal Scheduled pod/test-k8s-infra-otel-agent-tlr4x Successfully assigned apm/test-k8s-infra-otel-agent-tlr4x to ip-10-221-104-219.ap-south-1.compute.internal
29m Normal SandboxChanged pod/test-k8s-infra-otel-agent-rrntf Pod sandbox changed, it will be killed and re-created.
29m Warning FailedCreatePodSandBox pod/test-k8s-infra-otel-agent-tlr4x Failed to create pod sandbox: rpc error: code = Unknown desc = [failed to set up sandbox container "edda6b015573522c3f9ddd33b46176ecce94a19bb6ef906b6a8aadd43f577450" network for pod "test-k8s-infra-otel-agent-tlr4x": networkPlugin cni failed to set up pod "test-k8s-infra-otel-agent-tlr4x_apm" network: add cmd: Error received from AddNetwork gRPC call: rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing dial tcp 127.0.0.1:50051: connect: connection refused", failed to clean up sandbox container "edda6b015573522c3f9ddd33b46176ecce94a19bb6ef906b6a8aadd43f577450" network for pod "test-k8s-infra-otel-agent-tlr4x": networkPlugin cni failed to teardown pod "test-k8s-infra-otel-agent-tlr4x_apm" network: del cmd: error received from DelNetwork gRPC call: rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing dial tcp 127.0.0.1:50051: connect: connection refused"]
29m Warning FailedCreatePodSandBox pod/test-k8s-infra-otel-agent-rrntf Failed to create pod sandbox: rpc error: code = Unknown desc = [failed to set up sandbox container "5127920af716862f5fa1ee269878bde716737f2520fe5ceec4060541b1a274e5" network for pod "test-k8s-infra-otel-agent-rrntf": networkPlugin cni failed to set up pod "test-k8s-infra-otel-agent-rrntf_apm" network: add cmd: Error received from AddNetwork gRPC call: rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing dial tcp 127.0.0.1:50051: connect: connection refused", failed to clean up sandbox container "5127920af716862f5fa1ee269878bde716737f2520fe5ceec4060541b1a274e5" network for pod "test-k8s-infra-otel-agent-rrntf": networkPlugin cni failed to teardown pod "test-k8s-infra-otel-agent-rrntf_apm" network: del cmd: error received from DelNetwork gRPC call: rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing dial tcp 127.0.0.1:50051: connect: connection refused"]
29m Normal SandboxChanged pod/test-k8s-infra-otel-agent-tlr4x Pod sandbox changed, it will be killed and re-created.
29m Normal Pulling pod/test-k8s-infra-otel-agent-tlr4x Pulling image "docker.io/istio/proxyv2:1.11.8"
29m Normal Created pod/test-k8s-infra-otel-agent-tlr4x Created container istio-init
Jatin
05:37 AM29m Normal Started pod/test-k8s-infra-otel-agent-tlr4x Started container istio-init
29m Normal Pulling pod/test-k8s-infra-otel-agent-tlr4x Pulling image "docker.io/otel/opentelemetry-collector-contrib:0.79.0"
28m Warning FailedToUpdateEndpointSlices service/test-k8s-infra-otel-agent Error updating Endpoint Slices for Service apm/test-k8s-infra-otel-agent: node "ip-10-221-108-137.ap-south-1.compute.internal" not found
29m Normal Pulled pod/test-k8s-infra-otel-agent-rrntf Container image "docker.io/istio/proxyv2:1.11.8" already present on machine
29m Normal Pulling pod/test-k8s-infra-otel-agent-rrntf Pulling image "docker.io/otel/opentelemetry-collector-contrib:0.79.0"
29m Normal Started pod/test-k8s-infra-otel-agent-rrntf Started container istio-init
29m Normal Created pod/test-k8s-infra-otel-agent-rrntf Created container istio-init
28m Normal Started pod/test-k8s-infra-otel-agent-tlr4x Started container test-k8s-infra-otel-agent
28m Normal Pulled pod/test-k8s-infra-otel-agent-tlr4x Container image "docker.io/istio/proxyv2:1.11.8" already present on machine
28m Normal Created pod/test-k8s-infra-otel-agent-tlr4x Created container istio-proxy
28m Normal Started pod/test-k8s-infra-otel-agent-tlr4x Started container istio-proxy
28m Normal Created pod/test-k8s-infra-otel-agent-tlr4x Created container test-k8s-infra-otel-agent
28m Normal Pulled pod/test-k8s-infra-otel-agent-tlr4x Successfully pulled image "docker.io/otel/opentelemetry-collector-contrib:0.79.0" in 8.689811974s (8.689822774s including waiting)
28m Normal Pulled pod/test-k8s-infra-otel-agent-rrntf Successfully pulled image "docker.io/otel/opentelemetry-collector-contrib:0.79.0" in 21.345092739s (21.34510044s including waiting)
28m Normal Created pod/test-k8s-infra-otel-agent-rrntf Created container test-k8s-infra-otel-agent
28m Normal Pulled pod/test-k8s-infra-otel-agent-rrntf Container image "docker.io/istio/proxyv2:1.11.8" already present on machine
28m Normal Started pod/test-k8s-infra-otel-agent-rrntf Started container test-k8s-infra-otel-agent
28m Normal Created pod/test-k8s-infra-otel-agent-rrntf Created container istio-proxy
28m Normal Started pod/test-k8s-infra-otel-agent-rrntf Started container istio-proxy
28m Warning Unhealthy pod/test-k8s-infra-otel-agent-rrntf Readiness probe failed: Get "http://10.221.106.154:15021/healthz/ready": dial tcp 10.221.106.154:15021: connect: connection refused
22m Warning NodeNotReady pod/test-k8s-infra-otel-agent-5b55w Node is not ready
21m Normal SuccessfulCreate replicaset/test-signoz-frontend-564577c7d8 Created pod: test-signoz-frontend-564577c7d8-zxtz5
21m Normal Scheduled pod/test-signoz-frontend-564577c7d8-zxtz5 Successfully assigned apm/test-signoz-frontend-564577c7d8-zxtz5 to ip-10-221-107-222.ap-south-1.compute.internal
21m Normal Pulling pod/test-signoz-frontend-564577c7d8-zxtz5 Pulling image "docker.io/busybox:1.35"
21m Normal Pulled pod/test-signoz-frontend-564577c7d8-zxtz5 Successfully pulled image "docker.io/busybox:1.35" in 3.987910995s (3.987919355s including waiting)
20m Normal Started pod/test-signoz-frontend-564577c7d8-zxtz5 Started container test-signoz-frontend-init
20m Normal Created pod/test-signoz-frontend-564577c7d8-zxtz5 Created container test-signoz-frontend-init
20m Warning FailedToUpdateEndpointSlices service/test-signoz-frontend Error updating Endpoint Slices for Service apm/test-signoz-frontend: node "ip-10-221-110-220.ap-south-1.compute.internal" not found
20m Normal SuccessfulCreate daemonset/test-k8s-infra-otel-agent Created pod: test-k8s-infra-otel-agent-89pmm
20m Normal Scheduled pod/test-k8s-infra-otel-agent-89pmm Successfully assigned apm/test-k8s-infra-otel-agent-89pmm to ip-10-221-107-37.ap-south-1.compute.internal
20m Warning FailedToUpdateEndpointSlices service/test-k8s-infra-otel-agent Error updating Endpoint Slices for Service apm/test-k8s-infra-otel-agent: node "ip-10-221-110-220.ap-south-1.compute.internal" not found
19m Warning FailedScheduling pod/test-signoz-alertmanager-0 0/12 nodes are available: 4 node(s) had volume node affinity conflict, 8 Insufficient cpu.
20m Normal TaintManagerEviction pod/test-signoz-alertmanager-0 Cancelling deletion of Pod apm/test-signoz-alertmanager-0
20m Warning FailedToUpdateEndpointSlices service/test-signoz-alertmanager Error updating Endpoint Slices for Service apm/test-signoz-alertmanager: node "ip-10-221-110-220.ap-south-1.compute.internal" not found
20m Warning FailedToUpdateEndpointSlices service/test-signoz-alertmanager-headless Error updating Endpoint Slices for Service apm/test-signoz-alertmanager-headless: node "ip-10-221-110-220.ap-south-1.compute.internal" not found
20m Normal TaintManagerEviction pod/test-signoz-frontend-564577c7d8-n5ww8 Cancelling deletion of Pod apm/test-signoz-frontend-564577c7d8-n5ww8
20m Normal SuccessfulCreate statefulset/test-signoz-alertmanager create Pod test-signoz-alertmanager-0 in StatefulSet test-signoz-alertmanager successful
20m Warning FailedCreatePodSandBox pod/test-k8s-infra-otel-agent-89pmm Failed to create pod sandbox: rpc error: code = Unknown desc = [failed to set up sandbox container "7e94b9b0ba257ab51db4a7460b9249da8119829b9163f20a92888c593652b9cd" network for pod "test-k8s-infra-otel-agent-89pmm": networkPlugin cni failed to set up pod "test-k8s-infra-otel-agent-89pmm_apm" network: add cmd: Error received from AddNetwork gRPC call: rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing dial tcp 127.0.0.1:50051: connect: connection refused", failed to clean up sandbox container "7e94b9b0ba257ab51db4a7460b9249da8119829b9163f20a92888c593652b9cd" network for pod "test-k8s-infra-otel-agent-89pmm": networkPlugin cni failed to teardown pod "test-k8s-infra-otel-agent-89pmm_apm" network: del cmd: error received from DelNetwork gRPC call: rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing dial tcp 127.0.0.1:50051: connect: connection refused"]
19m Normal SandboxChanged pod/test-k8s-infra-otel-agent-89pmm Pod sandbox changed, it will be killed and re-created.
20m Normal NotTriggerScaleUp pod/test-signoz-alertmanager-0 pod didn't trigger scale-up: 1 max node group size reached, 1 node(s) had volume node affinity conflict
19m Normal Pulling pod/test-k8s-infra-otel-agent-89pmm Pulling image "docker.io/istio/proxyv2:1.11.8"
19m Normal TriggeredScaleUp pod/test-signoz-alertmanager-0 pod triggered scale-up: [{eks-ng-spot-3ac33d20-e8b5-4bc4-c587-b474d18bf00a 12->13 (max: 25)}]
Jatin
05:39 AM---------------------------
zookeeper 02:13:55.55
zookeeper 02:13:55.56 Welcome to the Bitnami zookeeper container
zookeeper 02:13:55.56 Subscribe to project updates by watching https://github.com/bitnami/containers
zookeeper 02:13:55.56 Submit issues and feature requests at https://github.com/bitnami/containers/issues
zookeeper 02:13:55.57
zookeeper 02:13:55.57 INFO ==> * Starting ZooKeeper setup *
zookeeper 02:13:55.62 WARN ==> You have set the environment variable ALLOW_ANONYMOUS_LOGIN=yes. For safety reasons, do not use this flag in a production environment.
zookeeper 02:13:55.64 INFO ==> Initializing ZooKeeper...
zookeeper 02:13:55.64 INFO ==> No injected configuration file found, creating default config files...
zookeeper 02:13:55.71 INFO ==> No additional servers were specified. ZooKeeper will run in standalone mode...
zookeeper 02:13:55.72 INFO ==> Deploying ZooKeeper with persisted data...
zookeeper 02:13:55.73 INFO ==> * ZooKeeper setup finished! *
zookeeper 02:13:55.75 INFO ==> * Starting ZooKeeper *
/opt/bitnami/java/bin/java
ZooKeeper JMX enabled by default
Using config: /opt/bitnami/zookeeper/bin/../conf/zoo.cfg
Removing file: Aug 6, 2023, 2:38:34 AM /bitnami/zookeeper/data/version-2/log.4f7
Removing file: Aug 7, 2023, 4:06:43 AM /bitnami/zookeeper/data/version-2/snapshot.4fb
Aug 10, 2023 (3 months ago)
Jatin
08:41 AMSigNoz Community
Indexed 1023 threads (61% resolved)
Similar Threads
Issue with Helm Installation in GKE Autopilot Cluster
Kalman faced issues with helm installation with pods stuck in init state, and some crashing in a GKE autopilot cluster. Mayur provided suggestions to diagnose the issue, including checking IAM permissions and storage classes, and adjusting resource limits in the helm values. The thread is unresolved.
Issues with SigNoz Setup and Data Persistence in AKS
Vaibhavi experienced issues setting up SigNoz in AKS, and faced data persistence issues after installation. Srikanth provided guidance on ClickHouse version compatibility and resource requirements, helping Vaibhavi troubleshoot and resolve the issue.
Issues with Signoz on k3s Cluster Using Helm
Nilanjan encountered issues with Signoz on a k3s cluster using Helm, with some pods not running. Srikanth and Prashant suggested using `kubectl describe` to diagnose the issue, but the problem remains unresolved.
Issues with SigNoz Install through Helm Chart
Romain experienced a delay in SigNoz installation through Helm Chart, with pods in init state. Prashant identified the issue as insufficient resources in the K8s cluster and suggested specifying a storage class for PVCs, resolving the problem.
SigNoz crashing in k8s due to ClickHouse OOM
Travis reported SigNoz crashing in k8s due to ClickHouse OOM. The team suggested increasing resources for ClickHouse, and other troubleshooting steps, but the issue remains unresolved.