Dashboard Load Issues and Possible Solutions

TLDR Al experiences dashboard loading issues since updating to `0.18.1`. Srikanth believes the issue is not version related and suggests examining queries, memory resources, and server distribution for improvements.

Photo of Al
Al
Tue, 18 Apr 2023 22:21:53 UTC

Hi everyone, Since updating to `0.18.1` I have noticed that dashboards are consistently failing to load with : `in.8c36b6666fd0bcae92f0.js:2 Error: API responded with 400 - encountered multiple errors: error in query-A: clickhouse: acquire conn timeout. you can increase the number of max open conn or the dial timeout` `error in query-B: clickhouse: acquire conn timeout. you can increase the number of max open conn or the dial timeout` `error in query-C: clickhouse: acquire conn timeout. you can increase the number of max open conn or the dial timeout` `at main.8c36b6666fd0bcae92f0.js:2:1724057` `at u (main.8c36b6666fd0bcae92f0.js:2:1715729)` `at Generator.<anonymous> (main.8c36b6666fd0bcae92f0.js:2:1717066)` `at Generator.next (main.8c36b6666fd0bcae92f0.js:2:1716092)` `at b (main.8c36b6666fd0bcae92f0.js:2:1721719)` `at a (main.8c36b6666fd0bcae92f0.js:2:1721922)` <snip> Are there any known issues with `0.18.1` that would explain this? I've included a screen capture of usage stats. My current retention settings are Metrics: 7 days, Traces: 1 days, Logs: 1 days, until I improve performance.

Photo of Srikanth
Srikanth
Wed, 19 Apr 2023 01:07:42 UTC

No, this issue should not be related to `0.18.1`. We have the client set up with the default number of connections, around 10 or 15. If you have long-running queries that don’t complete in reasonable other requests may timeout. We could make this number of connections configurable, but that won’t solve the issue entirely since, eventually, ClickHouse will throw a `TOO_MANY_SIMULTANEOUS_QUERIES` error. Can you help us understand your queries and the time range and amount of data you are querying?

Photo of Al
Al
Wed, 19 Apr 2023 19:22:13 UTC

Regarding the following query: `SELECT quantile(0.99)(durationNano) as p99, avg(durationNano) as avgDuration, count(*) as numCalls FROM signoz_traces.distributed_signoz_index_v2 WHERE serviceName = 'blah' AND name In ['Elasticsearch DELETE', 'Elasticsearch HEAD', 'Elasticsearch POST', 'Elasticsearch POST` <-- this list has 950 additional entries and fails with *Max query size exceeded:* Where is this query invoked from?

Photo of Al
Al
Wed, 19 Apr 2023 21:30:33 UTC

Srikanth 1. If I keep the date range to 1 hour, even 1 day, performance seems ok. 2. The problem occurs consistently, if I extend the date range to 1 week which surprises me with retention settings of Metrics: 7 days, Traces: 1 days, Logs: 1 days. 3. The chi-signoz-clickhouse-cluster-0-0-0 PVC volume has 114G of data 4. See attached table stats. One dashboard that is performing poorly has 16 panels. • 13 of the panels are metricsBuilder based, such as attached, screen capture. • 2 panels are clickhouse queries similar to: `SELECT` `fingerprint,` `max(value) AS value,` `toStartOfInterval(toDateTime(intDiv(timestamp_ms, 1000)), INTERVAL 60 SECOND) as ts,` `http_url,` `http_status_code` `FROM` `signoz_metrics.distributed_samples_v2 GLOBAL` `INNER JOIN (` `SELECT` `JSONExtractString(distributed_time_series_v2.labels, 'http_url') as http_url,` `JSONExtractString(distributed_time_series_v2.labels, 'http_status_code') as http_status_code,` `fingerprint` `FROM` `signoz_metrics.distributed_time_series_v2` `WHERE` `metric_name = 'httpcheck_status'` `) as filtered_time_series USING fingerprint` `WHERE` `metric_name = 'httpcheck_status'` `AND toDateTime(intDiv(timestamp_ms, 1000)) BETWEEN {{.start_datetime}} AND {{.end_datetime}}` `GROUP BY` `http_url,` `http_status_code,` `fingerprint,` `ts` `ORDER BY` `http_url,` `http_status_code,` `fingerprint,` `ts` • 1 panel has the following: `SELECT` `toStartOfInterval(timestamp, toIntervalMinute(1)) AS interval,` `peerService AS peer_service,` `serviceName,` `httpCode,` `toFloat64(count()) AS value` `FROM signoz_traces.distributed_signoz_index_v2` `WHERE stringTagMap['k8s.namespace.name'] = {{.namespace}}` `AND (peer_service != '')` `AND (httpCode != '')` `AND (httpCode NOT LIKE '2%%')` `AND timestamp BETWEEN {{.start_datetime}} AND {{.end_datetime}}` `GROUP BY (peerService, serviceName, httpCode, interval)` `ORDER BY (httpCode, interval) ASC`

Photo of Srikanth
Srikanth
Thu, 20 Apr 2023 01:11:29 UTC

> 1. The problem occurs consistently, if I extend the date range to 1 week which surprises me with retention settings of Metrics: 7 days, Traces: 1 days, Logs: 1 days. what are the memory resources given to ClickHouse? Loading 1 week data and ordering means it requires a much memory.

Photo of Al
Al
Thu, 20 Apr 2023 13:47:23 UTC

Srikanth `metadata.name: chi-signoz-clickhouse-cluster-0-0` `resources.requests.cpu: '1'` `resources.requests.memory: 6000Mi` Here is a weeks worth of memory, cpu usage for clickhouse.

Photo of Srikanth
Srikanth
Fri, 21 Apr 2023 08:15:41 UTC

Could you load the dashboard once again and share the logs of the query-service when this issue occurs again?

Photo of Al
Al
Sat, 22 Apr 2023 13:44:17 UTC

Photo of Al
Al
Sat, 22 Apr 2023 13:44:40 UTC

Srikanth Logs attached. Thanks

Photo of Al
Al
Sat, 22 Apr 2023 23:40:59 UTC

I'm experimenting with to see if this helps.

Photo of Al
Al
Tue, 25 Apr 2023 17:24:58 UTC

Srikanth Any updates here?

Photo of Srikanth
Srikanth
Tue, 25 Apr 2023 17:26:11 UTC

Thank you for sharing the logs, I haven’t gotten around to looking at this properly yet, I will get back to you on this soon.

Photo of Al
Al
Tue, 25 Apr 2023 17:32:05 UTC

Thank you. This does affect many aspects of signoz front end including: • Dashboard panels fail to load, • Dashboard variables fail to load (would be good if panels did not load until the variables have been selected and loaded, • Trace 'tags filter' often fails to load. Between this and the log filtering, it's hard to use signoz.

Photo of Al
Al
Fri, 28 Apr 2023 22:29:26 UTC

```NAME CPU(cores) MEMORY(bytes) chi-signoz-clickhouse-cluster-0-0-0 1177m 5825Mi signoz-otel-collector-b87bf5d54-qpsx9 2878m 966Mi signoz-otel-collector-metrics-7bdb76c7fd-fjs6g 842m 1320Mi signoz-alertmanager-0 2m 23Mi signoz-clickhouse-operator-6dd75c99f8-wz4sf 2m 52Mi signoz-frontend-595d64465b-qf777 1m 11Mi signoz-k8s-infra-otel-agent-dr4sl 42m 126Mi signoz-k8s-infra-otel-deployment-7d4857ff7c-h2q6n 2m 66Mi signoz-query-service-0 10m 145Mi signoz-zookeeper-0 5m 390Mi``` Hi Srikanth I have all of the above pods running on a single node. I tried adding a second node but ran into trouble with connections being refused, between pods running on different nodes. What can I safely divide onto separate nodes and still function, in order to improve the performance of the signoz UI.

Photo of Al
Al
Wed, 03 May 2023 20:54:40 UTC

I have reviewed I have also reviewed I did enable probabilistic sampling processor, but found traces and logs for unique deployments were not available. Would the following configuration make sense? *Node 1* • chi-signoz-clickhouse-cluster • signoz-alertmanager • signoz-clickhouse-operator • signoz-frontend • signoz-query-service • signoz-zookeeper *Node 2* • signoz-otel-collector • signoz-otel-collector-metrics