Issue with Helm Installation in GKE Autopilot Cluster
TLDR Kalman faced issues with helm installation with pods stuck in init state, and some crashing in a GKE autopilot cluster. Mayur provided suggestions to diagnose the issue, including checking IAM permissions and storage classes, and adjusting resource limits in the helm values. The thread is unresolved.
3
1
Oct 31, 2023 (1 month ago)
Kalman
10:40 AMNAME READY STATUS RESTARTS AGE
signoz-alertmanager-0 0/1 Init:0/1 0 16m
signoz-frontend-8f8bfc6-j9cfh 0/1 Init:0/1 0 16m
signoz-k8s-infra-otel-agent-drlv2 0/1 CrashLoopBackOff 7 (47s ago) 16m
signoz-otel-collector-67949fc956-5tjx2 0/1 CrashLoopBackOff 6 (36s ago) 16m
signoz-query-service-0 0/1 Pending 0 16m
followed instructions here: https://signoz.io/docs/install/kubernetes/gcp/#gke-autopilot
Mayur
10:51 AMMayur
10:52 AMKalman
10:53 AMMayur
10:53 AMKalman
10:53 AMMayur
10:53 AMKalman
10:54 AMKalman
10:54 AMMayur
10:54 AMMayur
10:54 AMKalman
10:55 AMType Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 20m gke.io/optimize-utilization-scheduler Successfully assigned signoz/signoz-k8s-infra-otel-agent-drlv2 to gk3-gke-europe-west6-pool-3-2b2246ae-6n2b
Warning Unhealthy 19m kubelet Readiness probe failed: Get " ": read tcp 10.0.65.129:47512->10.0.65.147:13133: read: connection reset by peer
Warning Unhealthy 19m kubelet Liveness probe failed: Get " ": read tcp 10.0.65.129:47500->10.0.65.147:13133: read: connection reset by peer
Normal Pulled 17m (x4 over 20m) kubelet Container image "docker.io/otel/opentelemetry-collector-contrib:0.79.0" already present on machine
Normal Created 17m (x4 over 20m) kubelet Created container signoz-k8s-infra-otel-agent
Normal Started 17m (x4 over 20m) kubelet Started container signoz-k8s-infra-otel-agent
Warning Unhealthy 17m kubelet Readiness probe failed: Get " ": read tcp 10.0.65.129:33194->10.0.65.147:13133: read: connection reset by peer
Warning Unhealthy 17m kubelet Liveness probe failed: Get " ": read tcp 10.0.65.129:33186->10.0.65.147:13133: read: connection reset by peer
Warning BackOff 3s (x79 over 19m) kubelet Back-off restarting failed container signoz-k8s-infra-otel-agent in pod signoz-k8s-infra-otel-agent-drlv2_signoz(0b38256e-d87e-4475-82b5-cc822da1eb7a
Kalman
10:56 AMhow can i check this?
Mayur
10:56 AMKalman
10:57 AM{"level":"error","timestamp":"2023-10-31T10:52:36.882Z","caller":"client/wsclient.go:170","msg":"Connection failed (dial tcp 10.0.31.235:4320: i/o timeout), will retry.","component":"opamp-server-client","stacktrace":"\n\t/home/runner/go/pkg/mod/github.com/open-telemetry/[email protected]/client/wsclient.go:170\ngithub.com/open-telemetry/opamp-go/client.(*wsClient).runOneCycle\n\t/home/runner/go/pkg/mod/github.com/open-telemetry/[email protected]/client/wsclient.go:202\ngithub.com/open-telemetry/opamp-go/client.(*wsClient).runUntilStopped\n\t/home/runner/go/pkg/mod/github.com/open-telemetry/[email protected]/client/wsclient.go:265\ngithub.com/open-telemetry/opamp-go/client/internal.(*ClientCommon).StartConnectAndRun.func1\n\t/home/runner/go/pkg/mod/github.com/open-telemetry/[email protected]/client/internal/clientcommon.go:197"}
Mayur
10:57 AMKalman
10:57 AMMayur
10:57 AMKalman
10:57 AMKalman
10:58 AMpremium-rwo
runs fine1
Kalman
10:59 AMKalman
10:59 AM❯ kubectl get statefulset
NAME READY AGE
chi-signoz-clickhouse-cluster-0-0 0/1 23m
signoz-alertmanager 0/1 25m
signoz-query-service 0/1 25m
signoz-zookeeper 1/1 25m
Mayur
10:59 AMMayur
11:00 AMKalman
11:00 AMKalman
11:01 AM2023.10.31 10:58:32.965776 [ 194 ] {} <Error> MergeTreeBackgroundExecutor: Exception while executing background task {bec2cc52-3957-4964-a30a-4e8ee0cc582b::202310_1_95_19}: Code: 241. DB::Exception: Memory limit (total) exceeded: would use 501.56 MiB (attempt to allocate chunk of 4582439 bytes), maximum: 460.80 MiB. OvercommitTracker decision: Memory overcommit isn't used. Waiting time or overcommit denominator are set to zero. (MEMORY_LIMIT_EXCEEDED), Stack trace (when copying this message, always include the lines below):
Kalman
11:01 AMMayur
11:02 AMKalman
11:03 AMKalman
11:04 AM❯ kubectl top nodes
NAME CPU(cores) CPU% MEMORY(bytes) MEMORY%
gk3-gke-europe-west6-nap-bkvssbza-d20a0b49-vwj4 67m 1% 1853Mi 14%
gk3-gke-europe-west6-nap-qvub459m-12f48f1a-rp6g 148m 3% 2161Mi 16%
gk3-gke-europe-west6-nap-qvub459m-95c42732-nkgq 181m 4% 5119Mi 38%
gk3-gke-europe-west6-pool-3-2b2246ae-6n2b 202m 5% 3581Mi 27%
gk3-gke-europe-west6-pool-3-7b2a27f1-gk2p 215m 5% 2079Mi 15%
gk3-gke-europe-west6-pool-3-7b2a27f1-n6v4 201m 5% 2567Mi 19%
Kalman
11:04 AMKalman
11:06 AMMayur
11:06 AMMaybe try removing the limits or scale up your nodes
Mayur
11:06 AMKalman
11:07 AMMayur
11:07 AMKalman
11:09 AM1
Nocnica
03:25 PM2
SigNoz Community
Indexed 1023 threads (61% resolved)
Similar Threads
SigNoz crashing in k8s due to ClickHouse OOM
Travis reported SigNoz crashing in k8s due to ClickHouse OOM. The team suggested increasing resources for ClickHouse, and other troubleshooting steps, but the issue remains unresolved.
Issues with SigNoz Setup and Data Persistence in AKS
Vaibhavi experienced issues setting up SigNoz in AKS, and faced data persistence issues after installation. Srikanth provided guidance on ClickHouse version compatibility and resource requirements, helping Vaibhavi troubleshoot and resolve the issue.
Issues with SigNoz Install through Helm Chart
Romain experienced a delay in SigNoz installation through Helm Chart, with pods in init state. Prashant identified the issue as insufficient resources in the K8s cluster and suggested specifying a storage class for PVCs, resolving the problem.
Troubleshooting SigNoz Auto-Instrumentation Configuration
igor is having trouble configuring auto-instrumentation for Java applications using SigNoz, with traces not appearing in the SigNoz UI. Prashant advises to check logs of the otel sidecar, use service name for endpoint, verify supported libraries, and test with telemetrygen. However, the issue still persists.
Troubleshooting Memory Space Issue in Kubernetes with Signoz
Abel had trouble running signoz on Kubernetes due to 'not enough space'. Pranay provided steps to increase PV. Eventually, Abel confirmed solution after changing PV size to '50Gi'.