CPU Imbalance in Clickhouse Clustered with SigNoz

TLDR Nick experienced CPU imbalance in a 3-node Clickhouse cluster for SigNoz. Srikanth provided insights on schema and operations, and the issue resolved after 2 weeks due to data expiration.

Photo of Nick
Nick
Wed, 29 Mar 2023 13:32:28 UTC

Since switching to Clickhouse clustered, with 3 nodes, I'm finding that the CPU balancing between the 3 clickhouse instances is imbalanced. This system is fairly small (240 total pods, with all signoz log traffic filtered) and receives 3700 log lines per minute. I have one clickhouse node permanently running at 400% CPU, another node running between 100% and 300% CPU (dependent on time of day) and another node at a consistent 100% CPU. Graph here is 24 hour period: CPU utilisation seem very high for a database that i would consider to be receiving little in the way of updates. I suspect a missing index somewhere. Is there a description anywhere of what the ideal schema should look like for a signoz install for all the tables and views?

Photo of Srikanth
Srikanth
Wed, 29 Mar 2023 14:38:07 UTC

Is this chart you shared the ClickHouse instance powering SigNoz? > CPU utilisation seem very high for a database that i would consider to be receiving little in the way of updates. Which updates are you referring? We don’t push any `UPDATE` commands in SigNoz and where do you see the `database` wise CPU utilization? You can use the Unix tool ``sudo perf top`` to get the operation which are taking time. In the past we have noticed this happens when the `MergeMutate` which is responsible for merge the many temporary files to single data block responsible for this.

Photo of Nick
Nick
Wed, 29 Mar 2023 14:48:34 UTC

The chart I shared shows the CPU utilisation on three VMs. Each VM is running clickhouse-server for SigNoz. That clickhouse-server is only used by SigNoz, When I say updates, I mean any modification to the data in the database e.g. a `INSERT INTO signoz_logs` statement, which is effectively an update to a database.

Photo of Srikanth
Srikanth
Wed, 29 Mar 2023 14:54:09 UTC

No, insert and update are entirely different things. The update operations are CPU intensive and we don’t have any update operations to ClickHouse in SigNoz as of today. How many CPUs are given to each node and what is the write rate? You may want to check `perf top` to check which is claiming more CPU.

Photo of Nick
Nick
Wed, 29 Mar 2023 14:57:38 UTC

I know that `INSERT` and `UPDATE` SQL statements are entirely different things but the modifiication of a database in any way is an update to a database (the database has been updated with new data). This sounds like an argument over nomenclature. Each node as 4 CPUs. The write write is 3700 logs per minute (~61 per second) The IO utilisation on the server is less than 5%, with write-queue latency at 12ms.

Photo of Nick
Nick
Wed, 29 Mar 2023 14:59:05 UTC

`perf top` isn't giving useful information due to lack of access to any symbols at the moment

Photo of Srikanth
Srikanth
Wed, 29 Mar 2023 15:01:06 UTC

No, I was not arguing over the nomenclature. I want to use the unambiguous language because the real update operations are known to be CPU intensive for ClickHouse. Any mixed use of the words can be confusing so better avoid that.

Photo of Nick
Nick
Wed, 29 Mar 2023 15:03:10 UTC

I was asking if there is a copy of the correct finalised schema written anywhere for me to compare my schema against to ensure it matches. I can only find the schema definitions in the various migration directories

Photo of Srikanth
Srikanth
Wed, 29 Mar 2023 15:05:11 UTC

No, the final schema is not written anywhere because the schema keeps changing. When the collector boot up they ensure the schema is up to date with the expected schema.

Photo of Nick
Nick
Wed, 29 Mar 2023 15:06:01 UTC

Given there are multiple collectors running on a system (i.e. one per node) what stops the race condition of several collectors trying to modify the schema at the same time when it is out of date?

Photo of Srikanth
Srikanth
Wed, 29 Mar 2023 15:07:03 UTC

The migration program acquires the DB lock before running them. I will have to double check this but this is what I read last time.

Photo of Nick
Nick
Wed, 29 Mar 2023 15:15:41 UTC

Looking at the process list, it's taking over 1.2 seconds to run: `SELECT DISTINCT fingerprint FROM signoz_metrics.distributed_time_series_v2` And more than 0.5 seconds to run: `INSERT INTO signoz_logs.distributed_log` So it's generally just slow. But that's compounded by because of the full CPU utilisation of the clickhouse-service on the VM.

Photo of Srikanth
Srikanth
Wed, 29 Mar 2023 15:25:19 UTC

Do you just use the logs? Or do you also send metrics and traces?

Photo of Nick
Nick
Wed, 29 Mar 2023 15:27:46 UTC

I use logs, traces and metrics. I've reduced metrics collection from 30s to 120s, with no impact on CPU. 302,000 traces collected in the past 60 minutes.

Photo of Nick
Nick
Wed, 29 Mar 2023 15:31:46 UTC

I suppose it doesn't help that graphql services tend to produce thousands of spans due to the size of the query-result e.g. a query that returns 10 rows of 30 columns would create a minimum of 10x30 spans

Photo of Nick
Nick
Wed, 29 Mar 2023 15:33:58 UTC

If the fingerprint has to be looked up for each span ingested, then it could be quite expensive

Photo of Nick
Nick
Wed, 29 Mar 2023 15:38:38 UTC

But I'd still expect that these data should be distributed evenly across sharded database, rather than pegging one node at 400% and the others not sharing the load

Photo of Srikanth
Srikanth
Wed, 29 Mar 2023 15:40:37 UTC

No, there is no kind of look-up based insertion for spans. The span shard key is traceid which is random 128 bits and should do fine for the distribution.

Photo of Nick
Nick
Fri, 31 Mar 2023 09:36:09 UTC

I looked at the underlying shared tables (logs, metrics_v2 etc) across the nodes to confirm that the data is balanced, by comparing row counts. After 2 weeks of high CPU utilisation (100% CPU on node 2 for 7 days, followed by 100% CPU on node 1 for 7 days), the system settled down to background levels of utilisation (~10%). So perhaps this is an artefact of moving from 1-node to 3-node clickhouse cluster and the time it takes for imbalanced data to be expired from the cluster based on the existing configured 7 day retention rule.