#general

Issue with Helm Installation in GKE Autopilot Cluster

TLDR Kalman faced issues with helm installation with pods stuck in init state, and some crashing in a GKE autopilot cluster. Mayur provided suggestions to diagnose the issue, including checking IAM permissions and storage classes, and adjusting resource limits in the helm values. The thread is unresolved.

Powered by Struct AI

3

1

Oct 31, 2023 (1 month ago)
Kalman
Photo of md5-4cc78df26328b2e06a6c29b52295ba05
Kalman
10:40 AM
after helm install pods stuck in init state and some in crash loop:
NAME                                                READY   STATUS             RESTARTS      AGE
signoz-alertmanager-0                               0/1     Init:0/1           0             16m
signoz-frontend-8f8bfc6-j9cfh                       0/1     Init:0/1           0             16m
signoz-k8s-infra-otel-agent-drlv2                   0/1     CrashLoopBackOff   7 (47s ago)   16m
signoz-otel-collector-67949fc956-5tjx2              0/1     CrashLoopBackOff   6 (36s ago)   16m
signoz-query-service-0                              0/1     Pending            0             16m

followed instructions here: https://signoz.io/docs/install/kubernetes/gcp/#gke-autopilot
Mayur
Photo of md5-4c7615e5a15e848a965353c6636c09c7
Mayur
10:51 AM
Could you describe them for the error?
10:52
Mayur
10:52 AM
You are deploying it in gke?
Kalman
Photo of md5-4cc78df26328b2e06a6c29b52295ba05
Kalman
10:53 AM
yes, gke autopilot cluster
Mayur
Photo of md5-4c7615e5a15e848a965353c6636c09c7
Mayur
10:53 AM
Just describe the pods for the error
Kalman
Photo of md5-4cc78df26328b2e06a6c29b52295ba05
Kalman
10:53 AM
which one?
Mayur
Photo of md5-4c7615e5a15e848a965353c6636c09c7
Mayur
10:53 AM
The one thats crashing
Kalman
Photo of md5-4cc78df26328b2e06a6c29b52295ba05
Kalman
10:54 AM
ok
10:54
Kalman
10:54 AM
and what part of the describe is interesting?
Mayur
Photo of md5-4c7615e5a15e848a965353c6636c09c7
Mayur
10:54 AM
Does your cluster has necessary IAM permissions for accessing the storage?
10:54
Mayur
10:54 AM
The end part where events are described is fine
Kalman
Photo of md5-4cc78df26328b2e06a6c29b52295ba05
Kalman
10:55 AM
ok. for example: Events:
Type     Reason     Age                From                                   Message
  ----     ------     ----               ----                                   -------
  Normal   Scheduled  20m                gke.io/optimize-utilization-scheduler  Successfully assigned signoz/signoz-k8s-infra-otel-agent-drlv2 to gk3-gke-europe-west6-pool-3-2b2246ae-6n2b
  Warning  Unhealthy  19m                kubelet                                Readiness probe failed: Get "": read tcp 10.0.65.129:47512->10.0.65.147:13133: read: connection reset by peer
  Warning  Unhealthy  19m                kubelet                                Liveness probe failed: Get "": read tcp 10.0.65.129:47500->10.0.65.147:13133: read: connection reset by peer
  Normal   Pulled     17m (x4 over 20m)  kubelet                                Container image "docker.io/otel/opentelemetry-collector-contrib:0.79.0" already present on machine
  Normal   Created    17m (x4 over 20m)  kubelet                                Created container signoz-k8s-infra-otel-agent
  Normal   Started    17m (x4 over 20m)  kubelet                                Started container signoz-k8s-infra-otel-agent
  Warning  Unhealthy  17m                kubelet                                Readiness probe failed: Get "": read tcp 10.0.65.129:33194->10.0.65.147:13133: read: connection reset by peer
  Warning  Unhealthy  17m                kubelet                                Liveness probe failed: Get "": read tcp 10.0.65.129:33186->10.0.65.147:13133: read: connection reset by peer
  Warning  BackOff    3s (x79 over 19m)  kubelet                                Back-off restarting failed container signoz-k8s-infra-otel-agent in pod signoz-k8s-infra-otel-agent-drlv2_signoz(0b38256e-d87e-4475-82b5-cc822da1eb7a

10:56
Kalman
10:56 AM
> necessary IAM permissions for accessing the storage?
how can i check this?
Mayur
Photo of md5-4c7615e5a15e848a965353c6636c09c7
Mayur
10:56 AM
I dont see any error here. Any clues from logs?
Kalman
Photo of md5-4cc78df26328b2e06a6c29b52295ba05
Kalman
10:57 AM
{"level":"error","timestamp":"2023-10-31T10:52:36.882Z","caller":"client/wsclient.go:170","msg":"Connection failed (dial tcp 10.0.31.235:4320: i/o timeout), will retry.","component":"opamp-server-client","stacktrace":"\n\t/home/runner/go/pkg/mod/github.com/open-telemetry/[email protected]/client/wsclient.go:170\ngithub.com/open-telemetry/opamp-go/client.(*wsClient).runOneCycle\n\t/home/runner/go/pkg/mod/github.com/open-telemetry/[email protected]/client/wsclient.go:202\ngithub.com/open-telemetry/opamp-go/client.(*wsClient).runUntilStopped\n\t/home/runner/go/pkg/mod/github.com/open-telemetry/[email protected]/client/wsclient.go:265\ngithub.com/open-telemetry/opamp-go/client/internal.(*ClientCommon).StartConnectAndRun.func1\n\t/home/runner/go/pkg/mod/github.com/open-telemetry/[email protected]/client/internal/clientcommon.go:197"}
Mayur
Photo of md5-4c7615e5a15e848a965353c6636c09c7
Mayur
10:57 AM
Im not familiar with gcp, so i dont know how. Maybe you can check with your admin
Kalman
Photo of md5-4cc78df26328b2e06a6c29b52295ba05
Kalman
10:57 AM
by storage you mean gcp storage classes?
Mayur
Photo of md5-4c7615e5a15e848a965353c6636c09c7
Mayur
10:57 AM
Yes
Kalman
Photo of md5-4cc78df26328b2e06a6c29b52295ba05
Kalman
10:57 AM
should be ok, i have other apps installed with storage works fine
10:58
Kalman
10:58 AM
i have a cockroachdb cluster with premium-rwo runs fine

1

10:59
Kalman
10:59 AM
probably the root cause of the issue is that clickhouse not running?
10:59
Kalman
10:59 AM
❯ kubectl get statefulset
NAME                                READY   AGE
chi-signoz-clickhouse-cluster-0-0   0/1     23m
signoz-alertmanager                 0/1     25m
signoz-query-service                0/1     25m
signoz-zookeeper                    1/1     25m
Mayur
Photo of md5-4c7615e5a15e848a965353c6636c09c7
Mayur
10:59 AM
oh yes, it should be running
11:00
Mayur
11:00 AM
Why isnt clickhouse running?
Kalman
Photo of md5-4cc78df26328b2e06a6c29b52295ba05
Kalman
11:00 AM
good question
11:01
Kalman
11:01 AM
2023.10.31 10:58:32.965776 [ 194 ] {} <Error> MergeTreeBackgroundExecutor: Exception while executing background task {bec2cc52-3957-4964-a30a-4e8ee0cc582b::202310_1_95_19}: Code: 241. DB::Exception: Memory limit (total) exceeded: would use 501.56 MiB (attempt to allocate chunk of 4582439 bytes), maximum: 460.80 MiB. OvercommitTracker decision: Memory overcommit isn't used. Waiting time or overcommit denominator are set to zero. (MEMORY_LIMIT_EXCEEDED), Stack trace (when copying this message, always include the lines below):
11:01
Kalman
11:01 AM
could be this?
Mayur
Photo of md5-4c7615e5a15e848a965353c6636c09c7
Mayur
11:02 AM
Whats the memory of your cluster nodes?
Kalman
Photo of md5-4cc78df26328b2e06a6c29b52295ba05
Kalman
11:03 AM
it’s autopilot
11:04
Kalman
11:04 AM
❯ kubectl top nodes
NAME                                              CPU(cores)   CPU%   MEMORY(bytes)   MEMORY%
gk3-gke-europe-west6-nap-bkvssbza-d20a0b49-vwj4   67m          1%     1853Mi          14%
gk3-gke-europe-west6-nap-qvub459m-12f48f1a-rp6g   148m         3%     2161Mi          16%
gk3-gke-europe-west6-nap-qvub459m-95c42732-nkgq   181m         4%     5119Mi          38%
gk3-gke-europe-west6-pool-3-2b2246ae-6n2b         202m         5%     3581Mi          27%
gk3-gke-europe-west6-pool-3-7b2a27f1-gk2p         215m         5%     2079Mi          15%
gk3-gke-europe-west6-pool-3-7b2a27f1-n6v4         201m         5%     2567Mi          19%
11:04
Kalman
11:04 AM
probably it’s the resource request for clickhouse in the helm values are not enough
11:06
Kalman
11:06 AM
but there is no limit set in the yaml, so it should be fine. i don’t know.
Mayur
Photo of md5-4c7615e5a15e848a965353c6636c09c7
Mayur
11:06 AM
https://github.com/SigNoz/charts/blob/main/charts/signoz/values.yaml#L157

Maybe try removing the limits or scale up your nodes
11:06
Mayur
11:06 AM
Limits are set i have shared the link
Kalman
Photo of md5-4cc78df26328b2e06a6c29b52295ba05
Kalman
11:07 AM
limits are commented out. no ?
Mayur
Photo of md5-4c7615e5a15e848a965353c6636c09c7
Mayur
11:07 AM
Oh yea Sorry
Kalman
Photo of md5-4cc78df26328b2e06a6c29b52295ba05
Kalman
11:09 AM
either way i try to set higher resource requests and will see..

1

Nocnica
Photo of md5-7d3e7e5883af6e145e90d5d5a7a25acf
Nocnica
03:25 PM
Thanks for tagging in Mayur, let me know if I can send you some SigNoz stickers!

2

SigNoz Community

Built with ClickHouse as datastore, SigNoz is an open-source APM to help you find issues in your deployed applications & solve them quickly | Knowledge Base powered by Struct.AI

Indexed 1023 threads (61% resolved)

Join Our Community