TLDR Al experiences dashboard loading issues since updating to `0.18.1`. Srikanth believes the issue is not version related and suggests examining queries, memory resources, and server distribution for improvements.
No, this issue should not be related to `0.18.1`. We have the client set up with the default number of connections, around 10 or 15. If you have long-running queries that don’t complete in reasonable other requests may timeout. We could make this number of connections configurable, but that won’t solve the issue entirely since, eventually, ClickHouse will throw a `TOO_MANY_SIMULTANEOUS_QUERIES` error. Can you help us understand your queries and the time range and amount of data you are querying?
Regarding the following query: `SELECT quantile(0.99)(durationNano) as p99, avg(durationNano) as avgDuration, count(*) as numCalls FROM signoz_traces.distributed_signoz_index_v2 WHERE serviceName = 'blah' AND name In ['Elasticsearch DELETE', 'Elasticsearch HEAD', 'Elasticsearch POST', 'Elasticsearch POST` <-- this list has 950 additional entries and fails with *Max query size exceeded:* Where is this query invoked from?
Srikanth 1. If I keep the date range to 1 hour, even 1 day, performance seems ok. 2. The problem occurs consistently, if I extend the date range to 1 week which surprises me with retention settings of Metrics: 7 days, Traces: 1 days, Logs: 1 days. 3. The chi-signoz-clickhouse-cluster-0-0-0 PVC volume has 114G of data 4. See attached table stats. One dashboard that is performing poorly has 16 panels. • 13 of the panels are metricsBuilder based, such as attached, screen capture. • 2 panels are clickhouse queries similar to: `SELECT` `fingerprint,` `max(value) AS value,` `toStartOfInterval(toDateTime(intDiv(timestamp_ms, 1000)), INTERVAL 60 SECOND) as ts,` `http_url,` `http_status_code` `FROM` `signoz_metrics.distributed_samples_v2 GLOBAL` `INNER JOIN (` `SELECT` `JSONExtractString(distributed_time_series_v2.labels, 'http_url') as http_url,` `JSONExtractString(distributed_time_series_v2.labels, 'http_status_code') as http_status_code,` `fingerprint` `FROM` `signoz_metrics.distributed_time_series_v2` `WHERE` `metric_name = 'httpcheck_status'` `) as filtered_time_series USING fingerprint` `WHERE` `metric_name = 'httpcheck_status'` `AND toDateTime(intDiv(timestamp_ms, 1000)) BETWEEN {{.start_datetime}} AND {{.end_datetime}}` `GROUP BY` `http_url,` `http_status_code,` `fingerprint,` `ts` `ORDER BY` `http_url,` `http_status_code,` `fingerprint,` `ts` • 1 panel has the following: `SELECT` `toStartOfInterval(timestamp, toIntervalMinute(1)) AS interval,` `peerService AS peer_service,` `serviceName,` `httpCode,` `toFloat64(count()) AS value` `FROM signoz_traces.distributed_signoz_index_v2` `WHERE stringTagMap['k8s.namespace.name'] = {{.namespace}}` `AND (peer_service != '')` `AND (httpCode != '')` `AND (httpCode NOT LIKE '2%%')` `AND timestamp BETWEEN {{.start_datetime}} AND {{.end_datetime}}` `GROUP BY (peerService, serviceName, httpCode, interval)` `ORDER BY (httpCode, interval) ASC`
> 1. The problem occurs consistently, if I extend the date range to 1 week which surprises me with retention settings of Metrics: 7 days, Traces: 1 days, Logs: 1 days. what are the memory resources given to ClickHouse? Loading 1 week data and ordering means it requires a much memory.
Srikanth `metadata.name: chi-signoz-clickhouse-cluster-0-0` `resources.requests.cpu: '1'` `resources.requests.memory: 6000Mi` Here is a weeks worth of memory, cpu usage for clickhouse.
Could you load the dashboard once again and share the logs of the query-service when this issue occurs again?
Srikanth Logs attached. Thanks
I'm experimenting with
Srikanth Any updates here?
Thank you for sharing the logs, I haven’t gotten around to looking at this properly yet, I will get back to you on this soon.
Thank you. This does affect many aspects of signoz front end including: • Dashboard panels fail to load, • Dashboard variables fail to load (would be good if panels did not load until the variables have been selected and loaded, • Trace 'tags filter' often fails to load. Between this and the log filtering, it's hard to use signoz.
```NAME CPU(cores) MEMORY(bytes) chi-signoz-clickhouse-cluster-0-0-0 1177m 5825Mi signoz-otel-collector-b87bf5d54-qpsx9 2878m 966Mi signoz-otel-collector-metrics-7bdb76c7fd-fjs6g 842m 1320Mi signoz-alertmanager-0 2m 23Mi signoz-clickhouse-operator-6dd75c99f8-wz4sf 2m 52Mi signoz-frontend-595d64465b-qf777 1m 11Mi signoz-k8s-infra-otel-agent-dr4sl 42m 126Mi signoz-k8s-infra-otel-deployment-7d4857ff7c-h2q6n 2m 66Mi signoz-query-service-0 10m 145Mi signoz-zookeeper-0 5m 390Mi``` Hi Srikanth I have all of the above pods running on a single node. I tried adding a second node but ran into trouble with connections being refused, between pods running on different nodes. What can I safely divide onto separate nodes and still function, in order to improve the performance of the signoz UI.
I have reviewed
Al
Tue, 18 Apr 2023 22:21:53 UTCHi everyone, Since updating to `0.18.1` I have noticed that dashboards are consistently failing to load with : `in.8c36b6666fd0bcae92f0.js:2 Error: API responded with 400 - encountered multiple errors: error in query-A: clickhouse: acquire conn timeout. you can increase the number of max open conn or the dial timeout` `error in query-B: clickhouse: acquire conn timeout. you can increase the number of max open conn or the dial timeout` `error in query-C: clickhouse: acquire conn timeout. you can increase the number of max open conn or the dial timeout` `at main.8c36b6666fd0bcae92f0.js:2:1724057` `at u (main.8c36b6666fd0bcae92f0.js:2:1715729)` `at Generator.<anonymous> (main.8c36b6666fd0bcae92f0.js:2:1717066)` `at Generator.next (main.8c36b6666fd0bcae92f0.js:2:1716092)` `at b (main.8c36b6666fd0bcae92f0.js:2:1721719)` `at a (main.8c36b6666fd0bcae92f0.js:2:1721922)` <snip> Are there any known issues with `0.18.1` that would explain this? I've included a screen capture of usage stats. My current retention settings are Metrics: 7 days, Traces: 1 days, Logs: 1 days, until I improve performance.