Services UI Goes Blank Due to Retention Issue with S3 Bucket

TLDR oluchi reports issue with services UI going blank after a while. Conversation explores possible reasons, such as S3 connection problems and disk space, but no resolution is reached.

Photo of oluchi
oluchi
Thu, 16 Feb 2023 12:23:26 UTC

Hello Signoz team, I noticed after a while, our services UI goes blank, we have set up retention with S3 bucket, please what could be actually wrong?

Photo of Ankit
Ankit
Thu, 16 Feb 2023 12:39:13 UTC

how many replicas of query-service are defined? It should be 1

Photo of Ankit
Ankit
Thu, 16 Feb 2023 12:39:43 UTC

Do the services appear and disappear or they have never seen after adding s3?

Photo of Ankit
Ankit
Thu, 16 Feb 2023 12:39:47 UTC

oluchi

Photo of oluchi
oluchi
Thu, 16 Feb 2023 12:47:31 UTC

hello Ankit, thanks for your response. 1. We have just one replica of signoz query 2. They disappear and they reappear after we uninstall and install signoz again (`S3 setup and annotation` are added in the values.yaml) file.

Photo of Ankit
Ankit
Thu, 16 Feb 2023 12:56:22 UTC

okay...can you share clickhouse logs? I am guessing if s3 connection fails, then clickhouse doesn't show any data. Also can you check if size of data is increasing in s3?

Photo of oluchi
oluchi
Thu, 16 Feb 2023 12:56:47 UTC

one second, let me do the checks :point_down: Ankit

Photo of oluchi
oluchi
Thu, 16 Feb 2023 12:59:38 UTC

``` worker.go:445:dropReplicas():start:infra/signoz-clickhouse/e9e59dca-39f7-4444-91e9-5fb092c9daa1:drop replicas based on AP I0205 00:16:26.531186 1 worker.go:462] worker.go:462:dropReplicas():end:infra/signoz-clickhouse/e9e59dca-39f7-4444-91e9-5fb092c9daa1:processed replicas: 0 I0205 00:16:26.531219 1 worker.go:419] includeStopped():infra/signoz-clickhouse/e9e59dca-39f7-4444-91e9-5fb092c9daa1:add CHI to monitoring I0205 00:16:26.802933 1 worker.go:485] infra/signoz-clickhouse/9ca4c129-c258-425d-80b1-a956508a0752:IPs of the CHI [*****] I0205 00:16:26.815881 1 worker.go:489] infra/signoz-clickhouse/342fa60b-416a-4027-ae25-6de4bca505b7:Update users IPS I0205 00:16:27.042605 1 worker.go:505] markReconcileComplete():infra/signoz-clickhouse/e9e59dca-39f7-4444-91e9-5fb092c9daa1:reconcile completed I0215 20:17:43.965089 1 controller.go:309] infra/signoz-clickhouse:endpointsInformer.UpdateFunc: IP ASSIGNED: []v1.EndpointSubset{ v1.EndpointSubset{ Addresses: []v1.EndpointAddress{ v1.EndpointAddress{ IP: "172.********", Hostname: "", NodeName: &"ip-*******l", TargetRef: nil, }, }, NotReadyAddresses: nil, Ports: []v1.EndpointPort{ v1.EndpointPort{ Name: "http", Port: 8123, Protocol: "TCP", AppProtocol: nil, }, v1.EndpointPort{ Name: "tcp", Port: 9000, Protocol: "TCP", AppProtocol: nil, }, }, }, } I0215 20:17:44.020501 1 worker.go:299] infra/signoz-clickhouse/f48fbf51-ff72-45f1-abd8-96a17e4f8191:IPs of the CHI [*******] I0215 20:17:44.026758 1 worker.go:303] infra/signoz-clickhouse/9afb9ed0-a38e-44a2-a57d-598971239d44:Update users IPS I0215 20:17:44.035005 1 worker.go:1645] updateConfigMap():infra/signoz-clickhouse/9afb9ed0-a38e-44a2-a57d-598971239d44:Update ConfigMap infra/chi-signoz-clickhouse-common-usersd```

Photo of Ankit
Ankit
Thu, 16 Feb 2023 13:21:01 UTC

this does not have much useful information

Photo of Ankit
Ankit
Thu, 16 Feb 2023 13:21:08 UTC

can you grep by `s3`?

Photo of Ankit
Ankit
Thu, 16 Feb 2023 13:21:25 UTC

also can you check size of s3 if that is receiving data?

Photo of oluchi
oluchi
Thu, 16 Feb 2023 13:22:15 UTC

Checking ... Ankit

Photo of oluchi
oluchi
Thu, 16 Feb 2023 13:27:30 UTC

No useful info came up with `s3` except the following Ankit ```{e899fee7-1eea-4e3f-b6dc-6e7bd6141071} <Error> TCPHandler: Code: 243. DB::Exception: Cannot reserve 1.00 MiB, not enough space. (NOT_ENOUGH_SPACE), Stack trace (when copying this message, always include the lines below):```

Photo of Ankit
Ankit
Thu, 16 Feb 2023 13:32:32 UTC

how much space is left in the disk?

Photo of Ankit
Ankit
Thu, 16 Feb 2023 13:38:45 UTC

Photo of Ankit
Ankit
Thu, 16 Feb 2023 13:41:00 UTC

cc: Prashant what's the default config? Maybe we want to change the defaults of clickhouse for better operation at scale

Photo of Ankit
Ankit
Thu, 16 Feb 2023 13:41:15 UTC

oluchi any idea how much data you were trying to ingest?

Photo of oluchi
oluchi
Thu, 16 Feb 2023 13:41:25 UTC

One second, checking now Ankit

Photo of Ankit
Ankit
Thu, 16 Feb 2023 13:41:57 UTC

and this message is also temporary..it gets fixed once heavy ingestion is over. Can you check the time of the error?

Photo of oluchi
oluchi
Thu, 16 Feb 2023 13:42:52 UTC

the time of the error, is an hour ago

Photo of oluchi
oluchi
Thu, 16 Feb 2023 13:44:08 UTC

about 10gb still left Ankit

Photo of Ankit
Ankit
Thu, 16 Feb 2023 13:49:34 UTC

might be related. I will let Srikanth dive deeper into the issue

Photo of oluchi
oluchi
Thu, 16 Feb 2023 13:50:14 UTC

Alright Ankit, thank you for your time!

Photo of Srikanth
Srikanth
Thu, 16 Feb 2023 14:02:41 UTC

oluchi Can you share your S3 configuration? Our retention is currently done on the span timestamp, and then only it moves the data to cold storage. However, you need to move the data based on disk availability. Did you configure the `move_factor`? What is the approximate ingestion estimate?

Photo of Prashant
Prashant
Thu, 16 Feb 2023 14:03:44 UTC

> ```{e899fee7-1eea-4e3f-b6dc-6e7bd6141071} <Error> TCPHandler: Code: 243. DB::Exception: Cannot reserve 1.00 MiB, not enough space. (NOT_ENOUGH_SPACE), Stack trace (when copying this message, always include the lines below):``` I have seen this error occurs when there is no enough storage for the clickhouse storage PVC i.e. `/var/lib/clickhouse` mount.

Photo of Prashant
Prashant
Thu, 16 Feb 2023 14:04:04 UTC

But yeah, do share your S3 configuration, so that we can have a look at it.

Photo of oluchi
oluchi
Thu, 16 Feb 2023 14:08:35 UTC

my default cold storage setup Prashant ```clickhouse: cloud: aws installCustomStorageClass: false persistence: size: 30Gi # Cold storage configuration coldStorage: enabled: true defaultKeepFreeSpaceBytes: "10485760"``` s3 config ```{ "Statement": [ { "Action": [ "s3:GetObject", "s3:GetObjectVersion", "s3:PutBucketVersioning", "s3:PutObject" ], "Effect": "Allow", "Resource": [ "arn:aws:s3:::<bucket name>", "arn:aws:s3:::<bucket_name>/*" ] } ], "Version": "2012-10-17" }```

Photo of Srikanth
Srikanth
Thu, 16 Feb 2023 14:12:55 UTC

`defaultKeepFreeSpaceBytes` is used to reserve some free space on any disk but that doesn’t move the data. What was your `move_factor` ?

Photo of oluchi
oluchi
Thu, 16 Feb 2023 14:13:55 UTC

`move_factor` is that a value on the values.yaml file?

Photo of Srikanth
Srikanth
Thu, 16 Feb 2023 14:18:29 UTC

I see this is unavailable in our charts, but I believe you could override this. I think that’s the reason you are not seeing services. Your disk space is getting filled, but the default detention (7 days) is set on the timestamp of the span, which will not move for a week. But since you haven’t set up any `move_factor` (i.e. % free disk space that should always exist, and if it crosses this threshold ClickHouse will move the data to cold storage).

Photo of oluchi
oluchi
Thu, 16 Feb 2023 14:21:11 UTC

Okay, thank you Srikanth, I will look up information on how to override the `move_factor`

Photo of Srikanth
Srikanth
Thu, 16 Feb 2023 14:24:09 UTC

Prashant how can oluchi add the `move_factor` for volumes in our charts ? I am not sure if this can be done with override.yaml.

Photo of Prashant
Prashant
Thu, 16 Feb 2023 15:27:29 UTC

it would not be possible right now with override.yaml. Maybe except for using `clickhouse.files` configuration.

Photo of Prashant
Prashant
Thu, 16 Feb 2023 15:30:04 UTC

Srikanth isn't the `move_factor` set to `0.1` by default?

Photo of Prashant
Prashant
Thu, 16 Feb 2023 15:32:48 UTC

shouldn't that be sufficient?

Photo of Prashant
Prashant
Thu, 16 Feb 2023 15:38:38 UTC

Photo of Srikanth
Srikanth
Thu, 16 Feb 2023 15:43:14 UTC

That’s why I was asking for the ingestion rate. If the rate is higher, the data get dropped before the background task can move. I wanted them to try something higher and test it.

Photo of oluchi
oluchi
Thu, 16 Feb 2023 15:44:06 UTC

Hello Srikanth, how do I check for ingestion rate, is it a kubectl cmd or I have to ssh into the clickhouse pods?

Photo of Srikanth
Srikanth
Thu, 16 Feb 2023 15:53:01 UTC

Yeah, you could get relevant info by querying in ClickHouse. Let me share some command that outputs the span per duration.

Photo of Srikanth
Srikanth
Thu, 16 Feb 2023 15:58:37 UTC

Can you exec into ClickHouse and share the output of this? ```SELECT toStartOfInterval(timestamp, toIntervalMinute(10)) AS time, count() AS count FROM signoz_traces.signoz_index_v2 GROUP BY time ORDER BY time ASC```

Photo of oluchi
oluchi
Thu, 16 Feb 2023 16:03:46 UTC

`not found`

Photo of oluchi
oluchi
Thu, 16 Feb 2023 16:04:27 UTC

Srikanth ```/ $ SELECT sh: SELECT: not found / $ toStartOfInterval(timestamp, toIntervalMinute(10)) AS time, sh: syntax error: unexpected word (expecting ")") / $ count() AS count / $ FROM signoz_traces.signoz_index_v2 sh: FROM: not found / $ GROUP BY time sh: GROUP: not found / $ ORDER BY time ASC sh: ORDER: not found / $ ```

Photo of Prashant
Prashant
Thu, 16 Feb 2023 16:07:14 UTC

oluchi you will have to execute it using `clickhouse client`

Photo of oluchi
oluchi
Thu, 16 Feb 2023 16:07:45 UTC

I thought as much Prashant, thanks

Photo of Alejandro
Alejandro
Thu, 11 May 2023 20:47:23 UTC

How can I drop data since a determinated day ?

Photo of Prashant
Prashant
Sun, 14 May 2023 16:59:44 UTC

Srikanth can you please look into this?