#general

Issue with Pending States in AWS Cluster

TLDR Jatin reported that Kubernetes pods get stuck in pending state when nodes go down in their AWS cluster. Despite providing kubectl describe results and logs, Prashant couldn't specify the cause, citing deeper investigation into the cluster and k8s resources would be required.

Powered by Struct AI
Aug 07, 2023 (4 months ago)
Jatin
Photo of md5-859f659c21b5cf5b19b44a304e154473
Jatin
05:26 AM
Hi, we setup signoz in K8s AWS everything worked fine but once a few node went down in our cluster then signoz-collector and query service get stuck in pending state when new pods come up
Prashant
Photo of md5-1899629483c7ab1dccfbee6cc2f637b9
Prashant
12:55 PM
it will be difficult to know why it is stuck in pending state without taking deeper look into the cluster and k8s resources.

can you share kubectl describe of the pods and associated pvc if any?
Aug 08, 2023 (4 months ago)
Jatin
Photo of md5-859f659c21b5cf5b19b44a304e154473
Jatin
05:13 AM
Installed signoz on aws followed https://signoz.io/docs/install/kubernetes/aws/


Installation with helm chart works fine but once a few node in our cluster goes down these pods get stuck in pending state

Signoz-frontend
Signoz-alertmanager
Signoz-query-service
Signoz-otel-collector
Signoz-otel-collector-mertics

All the above pods get stuck in init condition pod initialization

Signoz-frontend -> waits for query service to come up

signoz-alertmanager-> waits for query service to come up













Signoz-query-service
Signoz-otel-collector
Signoz-otel-collector-mertics
→*
*These wait for db to come up




PVS created
05:14
Jatin
05:14 AM
and i am not able to figure out what the problem with clickhouse db
05:15
Jatin
05:15 AM
kubectl describe does not show anything helpful
Prashant
Photo of md5-1899629483c7ab1dccfbee6cc2f637b9
Prashant
05:33 AM
logs or event from clickhouse/zookeeper pod or associated PVCs and PVs usually helps.
05:33
Prashant
05:33 AM
sometimes even the exit codes
Jatin
Photo of md5-859f659c21b5cf5b19b44a304e154473
Jatin
05:37 AM
kubectl get events --sort-by=.metadata.creationTimestamp -n apm
LAST SEEN TYPE REASON OBJECT MESSAGE
6m26s Warning Unhealthy pod/test-zookeeper-0 Readiness probe failed:
49m Warning NodeNotReady pod/test-k8s-infra-otel-agent-qxtml Node is not ready
47m Warning FailedToUpdateEndpointSlices service/test-k8s-infra-otel-agent Error updating Endpoint Slices for Service apm/test-k8s-infra-otel-agent: node "ip-10-221-106-81.ap-south-1.compute.internal" not found
47m Normal Scheduled pod/test-k8s-infra-otel-agent-rzz4x Successfully assigned apm/test-k8s-infra-otel-agent-rzz4x to ip-10-221-104-71.ap-south-1.compute.internal
47m Normal SuccessfulCreate daemonset/test-k8s-infra-otel-agent Created pod: test-k8s-infra-otel-agent-rzz4x
47m Warning FailedCreatePodSandBox pod/test-k8s-infra-otel-agent-rzz4x Failed to create pod sandbox: rpc error: code = Unknown desc = [failed to set up sandbox container "5178434579de3efe72deba3f8b9af43fcc67cd4dbf58a425924db91ef1737fd7" network for pod "test-k8s-infra-otel-agent-rzz4x": networkPlugin cni failed to set up pod "test-k8s-infra-otel-agent-rzz4x_apm" network: add cmd: Error received from AddNetwork gRPC call: rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing dial tcp 127.0.0.1:50051: connect: connection refused", failed to clean up sandbox container "5178434579de3efe72deba3f8b9af43fcc67cd4dbf58a425924db91ef1737fd7" network for pod "test-k8s-infra-otel-agent-rzz4x": networkPlugin cni failed to teardown pod "test-k8s-infra-otel-agent-rzz4x_apm" network: del cmd: error received from DelNetwork gRPC call: rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing dial tcp 127.0.0.1:50051: connect: connection refused"]
47m Normal SandboxChanged pod/test-k8s-infra-otel-agent-rzz4x Pod sandbox changed, it will be killed and re-created.
47m Normal Pulling pod/test-k8s-infra-otel-agent-rzz4x Pulling image "docker.io/istio/proxyv2:1.11.8"
46m Normal Pulled pod/test-k8s-infra-otel-agent-rzz4x Successfully pulled image "docker.io/istio/proxyv2:1.11.8" in 12.660076873s (12.660104413s including waiting)
46m Normal Started pod/test-k8s-infra-otel-agent-rzz4x Started container istio-init
46m Normal Created pod/test-k8s-infra-otel-agent-rzz4x Created container istio-init
46m Normal Pulling pod/test-k8s-infra-otel-agent-rzz4x Pulling image "docker.io/otel/opentelemetry-collector-contrib:0.79.0"
46m Normal Created pod/test-k8s-infra-otel-agent-rzz4x Created container test-k8s-infra-otel-agent
46m Normal Pulled pod/test-k8s-infra-otel-agent-rzz4x Successfully pulled image "docker.io/otel/opentelemetry-collector-contrib:0.79.0" in 6.76949216s (6.76950013s including waiting)
46m Normal Created pod/test-k8s-infra-otel-agent-rzz4x Created container istio-proxy
46m Normal Started pod/test-k8s-infra-otel-agent-rzz4x Started container test-k8s-infra-otel-agent
46m Normal Pulled pod/test-k8s-infra-otel-agent-rzz4x Container image "docker.io/istio/proxyv2:1.11.8" already present on machine
46m Normal Started pod/test-k8s-infra-otel-agent-rzz4x Started container istio-proxy
46m Warning Unhealthy pod/test-k8s-infra-otel-agent-rzz4x Readiness probe failed: Get "http://10.221.104.65:15021/healthz/ready": dial tcp 10.221.104.65:15021: connect: connection refused
31m Warning NodeNotReady pod/test-k8s-infra-otel-agent-pnvqw Node is not ready
29m Normal SuccessfulCreate daemonset/test-k8s-infra-otel-agent Created pod: test-k8s-infra-otel-agent-tlr4x
29m Normal SuccessfulCreate daemonset/test-k8s-infra-otel-agent Created pod: test-k8s-infra-otel-agent-rrntf
29m Normal Scheduled pod/test-k8s-infra-otel-agent-rrntf Successfully assigned apm/test-k8s-infra-otel-agent-rrntf to ip-10-221-107-222.ap-south-1.compute.internal
29m Normal Scheduled pod/test-k8s-infra-otel-agent-tlr4x Successfully assigned apm/test-k8s-infra-otel-agent-tlr4x to ip-10-221-104-219.ap-south-1.compute.internal
29m Normal SandboxChanged pod/test-k8s-infra-otel-agent-rrntf Pod sandbox changed, it will be killed and re-created.
29m Warning FailedCreatePodSandBox pod/test-k8s-infra-otel-agent-tlr4x Failed to create pod sandbox: rpc error: code = Unknown desc = [failed to set up sandbox container "edda6b015573522c3f9ddd33b46176ecce94a19bb6ef906b6a8aadd43f577450" network for pod "test-k8s-infra-otel-agent-tlr4x": networkPlugin cni failed to set up pod "test-k8s-infra-otel-agent-tlr4x_apm" network: add cmd: Error received from AddNetwork gRPC call: rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing dial tcp 127.0.0.1:50051: connect: connection refused", failed to clean up sandbox container "edda6b015573522c3f9ddd33b46176ecce94a19bb6ef906b6a8aadd43f577450" network for pod "test-k8s-infra-otel-agent-tlr4x": networkPlugin cni failed to teardown pod "test-k8s-infra-otel-agent-tlr4x_apm" network: del cmd: error received from DelNetwork gRPC call: rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing dial tcp 127.0.0.1:50051: connect: connection refused"]
29m Warning FailedCreatePodSandBox pod/test-k8s-infra-otel-agent-rrntf Failed to create pod sandbox: rpc error: code = Unknown desc = [failed to set up sandbox container "5127920af716862f5fa1ee269878bde716737f2520fe5ceec4060541b1a274e5" network for pod "test-k8s-infra-otel-agent-rrntf": networkPlugin cni failed to set up pod "test-k8s-infra-otel-agent-rrntf_apm" network: add cmd: Error received from AddNetwork gRPC call: rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing dial tcp 127.0.0.1:50051: connect: connection refused", failed to clean up sandbox container "5127920af716862f5fa1ee269878bde716737f2520fe5ceec4060541b1a274e5" network for pod "test-k8s-infra-otel-agent-rrntf": networkPlugin cni failed to teardown pod "test-k8s-infra-otel-agent-rrntf_apm" network: del cmd: error received from DelNetwork gRPC call: rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing dial tcp 127.0.0.1:50051: connect: connection refused"]
29m Normal SandboxChanged pod/test-k8s-infra-otel-agent-tlr4x Pod sandbox changed, it will be killed and re-created.
29m Normal Pulling pod/test-k8s-infra-otel-agent-tlr4x Pulling image "docker.io/istio/proxyv2:1.11.8"
29m Normal Created pod/test-k8s-infra-otel-agent-tlr4x Created container istio-init
05:37
Jatin
05:37 AM
29m Normal Pulled pod/test-k8s-infra-otel-agent-tlr4x Successfully pulled image "docker.io/istio/proxyv2:1.11.8" in 6.313893048s (6.313900888s including waiting)
29m Normal Started pod/test-k8s-infra-otel-agent-tlr4x Started container istio-init
29m Normal Pulling pod/test-k8s-infra-otel-agent-tlr4x Pulling image "docker.io/otel/opentelemetry-collector-contrib:0.79.0"
28m Warning FailedToUpdateEndpointSlices service/test-k8s-infra-otel-agent Error updating Endpoint Slices for Service apm/test-k8s-infra-otel-agent: node "ip-10-221-108-137.ap-south-1.compute.internal" not found
29m Normal Pulled pod/test-k8s-infra-otel-agent-rrntf Container image "docker.io/istio/proxyv2:1.11.8" already present on machine
29m Normal Pulling pod/test-k8s-infra-otel-agent-rrntf Pulling image "docker.io/otel/opentelemetry-collector-contrib:0.79.0"
29m Normal Started pod/test-k8s-infra-otel-agent-rrntf Started container istio-init
29m Normal Created pod/test-k8s-infra-otel-agent-rrntf Created container istio-init
28m Normal Started pod/test-k8s-infra-otel-agent-tlr4x Started container test-k8s-infra-otel-agent
28m Normal Pulled pod/test-k8s-infra-otel-agent-tlr4x Container image "docker.io/istio/proxyv2:1.11.8" already present on machine
28m Normal Created pod/test-k8s-infra-otel-agent-tlr4x Created container istio-proxy
28m Normal Started pod/test-k8s-infra-otel-agent-tlr4x Started container istio-proxy
28m Normal Created pod/test-k8s-infra-otel-agent-tlr4x Created container test-k8s-infra-otel-agent
28m Normal Pulled pod/test-k8s-infra-otel-agent-tlr4x Successfully pulled image "docker.io/otel/opentelemetry-collector-contrib:0.79.0" in 8.689811974s (8.689822774s including waiting)
28m Normal Pulled pod/test-k8s-infra-otel-agent-rrntf Successfully pulled image "docker.io/otel/opentelemetry-collector-contrib:0.79.0" in 21.345092739s (21.34510044s including waiting)
28m Normal Created pod/test-k8s-infra-otel-agent-rrntf Created container test-k8s-infra-otel-agent
28m Normal Pulled pod/test-k8s-infra-otel-agent-rrntf Container image "docker.io/istio/proxyv2:1.11.8" already present on machine
28m Normal Started pod/test-k8s-infra-otel-agent-rrntf Started container test-k8s-infra-otel-agent
28m Normal Created pod/test-k8s-infra-otel-agent-rrntf Created container istio-proxy
28m Normal Started pod/test-k8s-infra-otel-agent-rrntf Started container istio-proxy
28m Warning Unhealthy pod/test-k8s-infra-otel-agent-rrntf Readiness probe failed: Get "http://10.221.106.154:15021/healthz/ready": dial tcp 10.221.106.154:15021: connect: connection refused
22m Warning NodeNotReady pod/test-k8s-infra-otel-agent-5b55w Node is not ready
21m Normal SuccessfulCreate replicaset/test-signoz-frontend-564577c7d8 Created pod: test-signoz-frontend-564577c7d8-zxtz5
21m Normal Scheduled pod/test-signoz-frontend-564577c7d8-zxtz5 Successfully assigned apm/test-signoz-frontend-564577c7d8-zxtz5 to ip-10-221-107-222.ap-south-1.compute.internal
21m Normal Pulling pod/test-signoz-frontend-564577c7d8-zxtz5 Pulling image "docker.io/busybox:1.35"
21m Normal Pulled pod/test-signoz-frontend-564577c7d8-zxtz5 Successfully pulled image "docker.io/busybox:1.35" in 3.987910995s (3.987919355s including waiting)
20m Normal Started pod/test-signoz-frontend-564577c7d8-zxtz5 Started container test-signoz-frontend-init
20m Normal Created pod/test-signoz-frontend-564577c7d8-zxtz5 Created container test-signoz-frontend-init
20m Warning FailedToUpdateEndpointSlices service/test-signoz-frontend Error updating Endpoint Slices for Service apm/test-signoz-frontend: node "ip-10-221-110-220.ap-south-1.compute.internal" not found
20m Normal SuccessfulCreate daemonset/test-k8s-infra-otel-agent Created pod: test-k8s-infra-otel-agent-89pmm
20m Normal Scheduled pod/test-k8s-infra-otel-agent-89pmm Successfully assigned apm/test-k8s-infra-otel-agent-89pmm to ip-10-221-107-37.ap-south-1.compute.internal
20m Warning FailedToUpdateEndpointSlices service/test-k8s-infra-otel-agent Error updating Endpoint Slices for Service apm/test-k8s-infra-otel-agent: node "ip-10-221-110-220.ap-south-1.compute.internal" not found
19m Warning FailedScheduling pod/test-signoz-alertmanager-0 0/12 nodes are available: 4 node(s) had volume node affinity conflict, 8 Insufficient cpu.
20m Normal TaintManagerEviction pod/test-signoz-alertmanager-0 Cancelling deletion of Pod apm/test-signoz-alertmanager-0
20m Warning FailedToUpdateEndpointSlices service/test-signoz-alertmanager Error updating Endpoint Slices for Service apm/test-signoz-alertmanager: node "ip-10-221-110-220.ap-south-1.compute.internal" not found
20m Warning FailedToUpdateEndpointSlices service/test-signoz-alertmanager-headless Error updating Endpoint Slices for Service apm/test-signoz-alertmanager-headless: node "ip-10-221-110-220.ap-south-1.compute.internal" not found
20m Normal TaintManagerEviction pod/test-signoz-frontend-564577c7d8-n5ww8 Cancelling deletion of Pod apm/test-signoz-frontend-564577c7d8-n5ww8
20m Normal SuccessfulCreate statefulset/test-signoz-alertmanager create Pod test-signoz-alertmanager-0 in StatefulSet test-signoz-alertmanager successful
20m Warning FailedCreatePodSandBox pod/test-k8s-infra-otel-agent-89pmm Failed to create pod sandbox: rpc error: code = Unknown desc = [failed to set up sandbox container "7e94b9b0ba257ab51db4a7460b9249da8119829b9163f20a92888c593652b9cd" network for pod "test-k8s-infra-otel-agent-89pmm": networkPlugin cni failed to set up pod "test-k8s-infra-otel-agent-89pmm_apm" network: add cmd: Error received from AddNetwork gRPC call: rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing dial tcp 127.0.0.1:50051: connect: connection refused", failed to clean up sandbox container "7e94b9b0ba257ab51db4a7460b9249da8119829b9163f20a92888c593652b9cd" network for pod "test-k8s-infra-otel-agent-89pmm": networkPlugin cni failed to teardown pod "test-k8s-infra-otel-agent-89pmm_apm" network: del cmd: error received from DelNetwork gRPC call: rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing dial tcp 127.0.0.1:50051: connect: connection refused"]
19m Normal SandboxChanged pod/test-k8s-infra-otel-agent-89pmm Pod sandbox changed, it will be killed and re-created.
20m Normal NotTriggerScaleUp pod/test-signoz-alertmanager-0 pod didn't trigger scale-up: 1 max node group size reached, 1 node(s) had volume node affinity conflict
19m Normal Pulling pod/test-k8s-infra-otel-agent-89pmm Pulling image "docker.io/istio/proxyv2:1.11.8"
19m Normal TriggeredScaleUp pod/test-signoz-alertmanager-0 pod triggered scale-up: [{eks-ng-spot-3ac33d20-e8b5-4bc4-c587-b474d18bf00a 12->13 (max: 25)}]
05:39
Jatin
05:39 AM
ZOOKEEPER LOGS

---------------------------

zookeeper 02:13:55.55
zookeeper 02:13:55.56 Welcome to the Bitnami zookeeper container
zookeeper 02:13:55.56 Subscribe to project updates by watching https://github.com/bitnami/containers
zookeeper 02:13:55.56 Submit issues and feature requests at https://github.com/bitnami/containers/issues
zookeeper 02:13:55.57
zookeeper 02:13:55.57 INFO ==> * Starting ZooKeeper setup *
zookeeper 02:13:55.62 WARN ==> You have set the environment variable ALLOW_ANONYMOUS_LOGIN=yes. For safety reasons, do not use this flag in a production environment.
zookeeper 02:13:55.64 INFO ==> Initializing ZooKeeper...
zookeeper 02:13:55.64 INFO ==> No injected configuration file found, creating default config files...
zookeeper 02:13:55.71 INFO ==> No additional servers were specified. ZooKeeper will run in standalone mode...

zookeeper 02:13:55.72 INFO ==> Deploying ZooKeeper with persisted data...
zookeeper 02:13:55.73 INFO ==> * ZooKeeper setup finished! *
zookeeper 02:13:55.75 INFO ==> * Starting ZooKeeper *
/opt/bitnami/java/bin/java
ZooKeeper JMX enabled by default
Using config: /opt/bitnami/zookeeper/bin/../conf/zoo.cfg
Removing file: Aug 6, 2023, 2:38:34 AM /bitnami/zookeeper/data/version-2/log.4f7
Removing file: Aug 7, 2023, 4:06:43 AM /bitnami/zookeeper/data/version-2/snapshot.4fb
Aug 10, 2023 (3 months ago)
Jatin
Photo of md5-859f659c21b5cf5b19b44a304e154473
Jatin
08:41 AM
Prashant does this helps overall issue is with clickhouse db

SigNoz Community

Built with ClickHouse as datastore, SigNoz is an open-source APM to help you find issues in your deployed applications & solve them quickly | Knowledge Base powered by Struct.AI

Indexed 1023 threads (61% resolved)

Join Our Community