SigNoz crashing in k8s due to ClickHouse OOM
TLDR Travis reported SigNoz crashing in k8s due to ClickHouse OOM. The team suggested increasing resources for ClickHouse, and other troubleshooting steps, but the issue remains unresolved.
Mar 27, 2023 (8 months ago)
Travis
05:55 PMAnkit
06:10 PMAnkit
06:11 PMTravis
07:16 PMsignoz-otel-collector-init wget: can't connect to remote host (172.20.64.8): Connection refused
signoz-otel-collector-init waiting for clickhouseDB
stream logs failed container "signoz-otel-collector" in pod "signoz-otel-collector-76dd66c56c-98nk5" is waiting to start: PodInitializing for signoz/signoz-otel-collector-76dd66c56c-98nk5 (signoz-otel-collector)
Ankit
07:17 PMTravis
10:04 PMTravis
10:15 PMsignoz-otel-collector
pod?Travis
11:14 PMTravis
11:42 PMfwiw, here's some logs i found in the signoz-otel-collector-pods.
signoz-otel-collector 2023-03-27T23:40:17.465Z error exporterhelper/queued_retry.go:310 Dropping data because sending_queue is full. Try increasing queue_size. {"kind": "exporter", "data_type": "lo │
│ signoz-otel-collector │
│ signoz-otel-collector /go/pkg/mod/go.opentelemetry.io/[email protected]/exporter/exporterhelper/queued_retry.go:310 │
│ signoz-otel-collector │
│ signoz-otel-collector /go/pkg/mod/go.opentelemetry.io/[email protected]/exporter/exporterhelper/logs.go:114 │
│ signoz-otel-collector │
│ signoz-otel-collector /go/pkg/mod/go.opentelemetry.io/collector/[email protected]/logs.go:36 │
│ signoz-otel-collector │
│ signoz-otel-collector /go/pkg/mod/go.opentelemetry.io/collector/processor/[email protected]/batch_processor.go:339 │
│ signoz-otel-collector │
│ signoz-otel-collector /go/pkg/mod/go.opentelemetry.io/collector/processor/[email protected]/batch_processor.go:176 │
│ signoz-otel-collector │
│ signoz-otel-collector /go/pkg/mod/go.opentelemetry.io/collector/processor/[email protected]/batch_processor.go:144 │
│ signoz-otel-collector 2023-03-27T23:40:17.465Z warn [email protected]/batch_processor.go:178 Sender failed {"kind": "processor", "name": "batch", "pipeline": "logs", "error": "sending_queue is │
Mar 28, 2023 (8 months ago)
Travis
01:02 AMbut i'm still seeing the same issue.
Travis
10:15 PMMar 29, 2023 (8 months ago)
Ankit
04:06 AMSrikanth
04:37 AMsent_log_records
and failed_log_records
?Travis
03:23 PMSrikanth
03:26 PMSrikanth
03:38 PMSUM_RATE
of accepted_log_records
and SUM_RATE
of sent_log_records
in different panels and share the result screenshots?Travis
04:02 PMsignoz-query-service-init waiting for clickhouseDB
Srikanth
04:09 PMTravis
04:12 PMchi-signoz-clickhouse-cluster-0-0-0
.Travis
04:14 PM│ clickhouse 2023.03.29 16:12:03.236724 [ 7 ] {} <Information> Application: Setting max_server_memory_usage was set to 3.60 GiB (4.00 GiB available * 0.90 max_server_memory_usage_to_ram_ratio) │
│ clickhouse 2023.03.29 16:12:03.248365 [ 7 ] {} <Information> CertificateReloader: One of paths is empty. Cannot apply new configuration for certificates. Fill all paths and try again. │
│ clickhouse 2023.03.29 16:12:03.278497 [ 7 ] {} <Information> Application: Uncompressed cache policy name │
│ clickhouse 2023.03.29 16:12:03.278524 [ 7 ] {} <Information> Application: Uncompressed cache size was lowered to 2.00 GiB because the system has low amount of memory │
│ clickhouse 2023.03.29 16:12:03.279636 [ 7 ] {} <Information> Context: Initialized background executor for merges and mutations with num_threads=16, num_tasks=32 │
│ clickhouse 2023.03.29 16:12:03.279972 [ 7 ] {} <Information> Context: Initialized background executor for move operations with num_threads=8, num_tasks=8 │
│ clickhouse 2023.03.29 16:12:03.280512 [ 7 ] {} <Information> Context: Initialized background executor for fetches with num_threads=8, num_tasks=8 │
│ clickhouse 2023.03.29 16:12:03.280890 [ 7 ] {} <Information> Context: Initialized background executor for common operations (e.g. clearing old parts) with num_threads=8, num_tasks=8 │
│ clickhouse 2023.03.29 16:12:03.281002 [ 7 ] {} <Information> Application: Mark cache size was lowered to 2.00 GiB because the system has low amount of memory │
│ clickhouse 2023.03.29 16:12:03.281075 [ 7 ] {} <Information> Application: Loading user defined objects from /var/lib/clickhouse/ │
│ clickhouse 2023.03.29 16:12:03.282445 [ 7 ] {} <Information> Application: Loading metadata from /var/lib/clickhouse/ │
│ clickhouse 2023.03.29 16:12:03.310147 [ 7 ] {} <Information> DatabaseAtomic (system): Metadata processed, database system has 6 tables and 0 dictionaries in total. │
│ clickhouse 2023.03.29 16:12:03.310171 [ 7 ] {} <Information> TablesLoader: Parsed metadata of 6 tables in 1 databases in 0.012396625 sec │
│ clickhouse 2023.03.29 16:12:03.310199 [ 7 ] {} <Information> TablesLoader: Loading 6 tables with 0 dependency level │
│ clickhouse 2023.03.29 16:12:18.565650 [ 58 ] {} <Information> TablesLoader: 16.666666666666668% │
│ clickhouse 2023.03.29 16:13:21.737596 [ 58 ] {} <Information> TablesLoader: 33.333333333333336% │
│ clickhouse 2023.03.29 16:13:31.576439 [ 8 ] {} <Information> Application: Received termination signal (Terminated) │
│ signoz-clickhouse-init + chmod +x /var/lib/clickhouse/user_scripts/histogramQuantile │
│ Stream closed EOF for signoz/chi-signoz-clickhouse-cluster-0-0-0 (signoz-clickhouse-init) │
│ Stream closed EOF for signoz/chi-signoz-clickhouse-cluster-0-0-0 (clickhouse) │
│
Travis
04:34 PMSrikanth
04:37 PMPrashant
04:44 PMhttps://github.com/SigNoz/charts/blob/main/charts/signoz/values.yaml#L161-L167
Travis
04:47 PMhttps://signoz-community.slack.com/archives/C01HWQ1R0BC/p1679944653303509?thread_ts=1679939741.949799&cid=C01HWQ1R0BC
Travis
04:52 PMchi-signoz-clickhosue-cluster-0-0-0
, i see that it eventually crashes.│ clickhouse 2023.03.29 16:50:17.822883 [ 7 ] {} <Information> Application: Loading user defined objects from /var/lib/clickhouse/ │
│ clickhouse 2023.03.29 16:50:17.823295 [ 7 ] {} <Information> Application: Loading metadata from /var/lib/clickhouse/ │
│ clickhouse 2023.03.29 16:50:17.831628 [ 7 ] {} <Information> DatabaseAtomic (system): Metadata processed, database system has 6 tables and 0 dictionaries in total. │
│ clickhouse 2023.03.29 16:50:17.831656 [ 7 ] {} <Information> TablesLoader: Parsed metadata of 6 tables in 1 databases in 0.003232883 sec │
│ clickhouse 2023.03.29 16:50:17.831689 [ 7 ] {} <Information> TablesLoader: Loading 6 tables with 0 dependency level │
│ clickhouse 2023.03.29 16:50:31.297963 [ 59 ] {} <Information> TablesLoader: 16.666666666666668% │
│ Stream closed EOF for signoz/chi-signoz-clickhouse-cluster-0-0-0 (signoz-clickhouse-init) │
│ clickhouse 2023.03.29 16:51:24.583409 [ 59 ] {} <Information> TablesLoader: 50% │
│ clickhouse 2023.03.29 16:51:40.424811 [ 58 ] {} <Information> TablesLoader: 66.66666666666667% │
│ clickhouse 2023.03.29 16:51:46.080805 [ 8 ] {} <Information> Application: Received termination signal (Terminated) │
│
Travis
04:53 PMTravis
04:53 PMTravis
05:11 PMApplication: Received termination signal (Terminated)
clickhouse 2023.03.29 17:09:07.664915 [ 58 ] {} <Information> TablesLoader: 16.666666666666668% clickhouse 2023.03.29 17:09:45.751947 [ 58 ] {} <Information> TablesLoader: 33.333333333333336% clickhouse 2023.03.29 17:10:11.747418 [ 58 ] {} <Information> TablesLoader: 50% clickhouse 2023.03.29 17:10:15.038757 [ 8 ] {} <Information> Application: Received termination signal (Terminated) clickhouse 2023.03.29 17:10:22.257197 [ 60 ] {} <Information> TablesLoader: 66.66666666666667%
Prashant
05:14 PMTravis
05:16 PMTravis
05:16 PMPrashant
05:17 PMPrashant
05:17 PMkubectl describe
on the CHI pod?Prashant
05:18 PMPrashant
05:18 PMTravis
05:18 PM Events: │
│ Type Reason Age From Message │
│ ---- ------ ---- ---- ------- │
│ Normal Scheduled 4m35s default-scheduler Successfully assigned signoz/chi-signoz-clickhouse-cluster-0-0-0 to ip-10-0-3-214.us-west-2.compute.internal │
│ Normal Pulled 4m34s kubelet Container image "" already present on machine │
│ Normal Created 4m34s kubelet Created container signoz-clickhouse-init │
│ Normal Started 4m34s kubelet Started container signoz-clickhouse-init │
│ Normal Pulled 4m33s kubelet Container image "" already present on machine │
│ Normal Created 4m33s kubelet Created container clickhouse │
│ Normal Started 4m33s kubelet Started container clickhouse │
│ Warning Unhealthy 3m31s (x18 over 4m22s) kubelet Readiness probe failed: Get "": dial tcp 10.0.3.105:8123: connect: connection refused │
│ Warning Unhealthy 3m31s kubelet Liveness probe failed: Get "": dial tcp 10.0.3.105:8123: connect: connection refused │
│ │
Travis
05:20 PM│ State: Terminated │
│ Reason: Completed │
│ Exit Code: 0 │
│ Started: Wed, 29 Mar 2023 10:13:25 -0700 │
│ Finished: Wed, 29 Mar 2023 10:13:26 -0700 │
│ Ready: True
Prashant
05:26 PMTravis
05:28 PMchi-signoz-clickhouse-cluster-0-0-0
podTravis
05:30 PMTravis
05:30 PMContainers:
clickhouse:
Container ID: 1
Image:
Image ID:
Ports: 8123/TCP, 9000/TCP, 9009/TCP, 9000/TCP
Host Ports: 0/TCP, 0/TCP, 0/TCP, 0/TCP
Command:
/bin/bash
-c
/usr/bin/clickhouse-server --config-file=/etc/clickhouse-server/config.xml
State: Waiting
Reason: CrashLoopBackOff
Last State: Terminated
Reason: Error
Exit Code: 137
Started: Wed, 29 Mar 2023 10:25:26 -0700
Finished: Wed, 29 Mar 2023 10:27:25 -0700
Ready: False
Restart Count: 6
Requests:
cpu: 4
memory: 8Gi
Liveness: http-get http://:http/ping delay=60s timeout=1s period=3s #success=1 #failure=10
Readiness: http-get http://:http/ping delay=10s timeout=1s period=3s #success=1 #failure=3
Prashant
05:31 PMExit Code: 137
This confirms OOM.
Prashant
05:32 PMYou can increase the resource requests of clickhouse and test it out.
Travis
05:35 PMPrashant
05:36 PMPrashant
05:36 PMTravis
05:36 PMclickhouse.resources.requests.memory
Travis
05:37 PM resources:
requests:
cpu: '4'
memory: 16Gi
Prashant
05:37 PMTravis
05:39 PMTravis
05:40 PMsignoz-k8s-infra-otel-agent
configmap, yeah? receivers:
filelog/k8s:
exclude:
- /var/log/pods/kube-system_*.log
- /var/log/pods/*_hotrod*_*/*/*.log
- /var/log/pods/*_locust*_*/*/*.log
include:
- /var/log/pods/*/*/*.log
Travis
05:45 PMPrashant
06:22 PMPrashant
06:23 PM resources:
requests:
cpu: '1'
memory: 4Gi
limits:
cpu: '4'
memory: 16Gi
Prashant
06:26 PMTravis
06:31 PMfor reference, since clickhouse is working, here's what the logs look like. if that "rows/sec" metric is meaningful to you.
clickhouse 2023.03.29 18:30:07.892732 [ 216 ] <Information> executeQuery: Read 375301 rows, 2.86 MiB in 6.395357589 sec., 58683 rows/sec., 458.46 KiB/sec.
clickhouse 2023.03.29 18:30:10.799741 [ 235 ] <Information> executeQuery: Read 375301 rows, 2.86 MiB in 5.918852143 sec., 63407 rows/sec., 495.37 KiB/sec.
clickhouse 2023.03.29 18:30:13.988077 [ 11 ] <Information> executeQuery: Read 375301 rows, 2.86 MiB in 6.059014593 sec., 61940 rows/sec., 483.91 KiB/sec.
clickhouse 2023.03.29 18:30:14.038654 [ 10 ] <Information> executeQuery: Read 375301 rows, 2.86 MiB in 6.09963416 sec., 61528 rows/sec., 480.69 KiB/sec.
clickhouse 2023.03.29 18:30:18.055179 [ 229 ] <Information> executeQuery: Read 5 rows, 282.00 B in 26.759150599 sec., 0 rows/sec., 10.54 B/sec.
clickhouse 2023.03.29 18:30:18.079163 [ 235 ] <Information> executeQuery: Read 375301 rows, 2.86 MiB in 7.233857096 sec., 51881 rows/sec., 405.32 KiB/sec
Travis
06:39 PMrequests.memory
higher it seemsMar 30, 2023 (8 months ago)
Travis
04:03 AMTravis
04:04 AMTravis
04:05 AMbut
clickhouse client
command doesn't work. $ kubectl exec -n signoz -it chi-signoz-clickhouse-cluster-0-0-0 -- sh
Defaulted container "clickhouse" out of: clickhouse, signoz-clickhouse-init (init)
/ $ clickhouse client
ClickHouse client version 22.8.8.3 (official build).
Connecting to localhost:9000 as user default.
Code: 210. DB::NetException: Connection refused (localhost:9000). (NETWORK_ERROR)
Srikanth
06:20 AM> but
clickhouse client
command doesn’t work.Try
clickhouse-client
, ideally, both should work. Make sure you are exec’ing into clickhouse-cluster not the clickhouse-operator.Travis
03:20 PM<<K9s-Shell>> Pod: signoz/chi-signoz-clickhouse-cluster-0-0-0 | Container: clickhouse
bash-5.1$ clickhouse-client
ClickHouse client version 22.8.8.3 (official build).
Connecting to localhost:9000 as user default.
Code: 210. DB::NetException: Connection refused (localhost:9000). (NETWORK_ERROR)
Travis
03:21 PMSrikanth
03:35 PMSrikanth
03:36 PMTravis
03:51 PMTravis
03:54 PM/var/lib/clickhouse
👍Travis
03:55 PM/var/lib/clickhouse/data
is only 160kb.Travis
04:23 PM/var/lib/clickhouse/store
is, because the pod OOMs before du
has time to return any info to me and i lose my shell.Travis
04:24 PM/var/lib/clickhouse/store
dir altogether?Travis
04:29 PMSrikanth
04:30 PM/store
contains the part files, but I don’t know what else goes in there? Can you delete the whole PV data just to be safe and not leave it in any corrupt state?Travis
04:51 PM/var/lib/clickhouse/
dir?Travis
04:51 PMSrikanth
04:52 PMTravis
04:53 PMSigNoz Community
Indexed 1023 threads (61% resolved)
Similar Threads
Kubernetes Signoz-otel-collector Issue and Clickhouse Cold Storage
Pruthvi faced an issue with Kubernetes signoz-otel-collector. nitya-signoz suggested deleting the `signoz_logs` database and restarting collectors. Pruthvi then asked about Clickhouse cold storage on S3 and observed a spike in cost, which Ankit agreed to investigate further.
Issues with SigNoz Setup and Data Persistence in AKS
Vaibhavi experienced issues setting up SigNoz in AKS, and faced data persistence issues after installation. Srikanth provided guidance on ClickHouse version compatibility and resource requirements, helping Vaibhavi troubleshoot and resolve the issue.
Services UI Goes Blank Due to Retention Issue with S3 Bucket
oluchi reports issue with services UI going blank after a while. Conversation explores possible reasons, such as S3 connection problems and disk space, but no resolution is reached.
Reducing Log Output from ClickHouse with Helm Chart
Guillaume needed help lowering the log output from ClickHouse. Prashant suggested using an override-values.yaml configuration, which resolved the issue.
Dashboard Load Issues and Possible Solutions
Al experiences dashboard loading issues since updating to `0.18.1`. Srikanth believes the issue is not version related and suggests examining queries, memory resources, and server distribution for improvements.