Trouble with Zookeeper: Resolving Volume Node Affinity Conflict

TLDR surya was having issues with zookeeper and other services. Prashant suggested `nodeAffinity` settings adjustment and PVC recreation to resolve the volume node affinity conflict.

Photo of surya
surya
Wed, 06 Sep 2023 07:45:11 UTC

Hi Team, query service alert manager zookeeper is not working Can't find the reason can anyone help me with this issue

Photo of Prashant
Prashant
Thu, 07 Sep 2023 07:41:27 UTC

That’s strange. ClickHouse in cluster mode requires zookeeper pod to be ready. Can you share `kubectl pod describe` for the zookeeper pod? Also, make sure you have sufficient resources in the cluster/machine.

Photo of surya
surya
Thu, 07 Sep 2023 07:45:58 UTC

kubectl describe my-release-zookeeper-0 pod error: the server doesn't have a resource type "my-release-zookeeper-0"

Photo of Prashant
Prashant
Thu, 07 Sep 2023 07:56:48 UTC

complete command: ```kubectl -n platform describe pod/my-release-zookeeper-0```

Photo of surya
surya
Thu, 07 Sep 2023 07:57:26 UTC

Warning FailedScheduling 4m40s (x228 over 4h12m) default-scheduler 0/4 nodes are available: 4 node(s) had volume node affinity conflict.

Photo of surya
surya
Thu, 07 Sep 2023 07:57:59 UTC

but it works properly till last week

Photo of surya
surya
Thu, 07 Sep 2023 07:58:15 UTC

why suddenly get this error

Photo of Prashant
Prashant
Thu, 07 Sep 2023 09:35:21 UTC

`volume node affinity conflict` error happens when the PVC used by the pod is scheduled on different zones/region.

Photo of Prashant
Prashant
Thu, 07 Sep 2023 09:39:54 UTC

It is an issue seen in kubernetes cluster with nodes from multiple zones/regions. I had encountered this before as well. I resolved it by setting `nodeAffinity` for those components to match the same zone of the PVC.

Photo of Prashant
Prashant
Thu, 07 Sep 2023 09:41:22 UTC

if you do not care about data loss, you can go about deleting the PVC and restart statefulset pod(s). It should spawn the new PVCs in same zone as that of the pods.

Photo of surya
surya
Thu, 07 Sep 2023 09:46:03 UTC

No, i need data

Photo of surya
surya
Fri, 08 Sep 2023 10:33:42 UTC

Prashant, Since zookeeper, alert manager and query service pvc where facing this issue, i won't face any metric data loss after deleting the affected pvc. thanks

Photo of Prashant
Prashant
Fri, 08 Sep 2023 12:49:10 UTC

> Since zookeeper, alert manager and query service pvc where facing this issue, i won't face any metric data loss after deleting the affected pvc. actually SQLite would be removed since it is attached to `query-service` statefulset.

Photo of Prashant
Prashant
Fri, 08 Sep 2023 12:49:43 UTC

That would mean data related to user credentials, alerts, and dashboards would be affected.

Photo of surya
surya
Fri, 08 Sep 2023 12:50:51 UTC

Yes i was supposed to sign up again

Photo of surya
surya
Fri, 08 Sep 2023 12:51:11 UTC

And recreated the dashboard using config json

Photo of Prashant
Prashant
Fri, 08 Sep 2023 12:51:32 UTC

yes, that is correct

Photo of Prashant
Prashant
Fri, 08 Sep 2023 12:51:57 UTC

btw, which cloud vendor are you using for managing K8s cluster?

Photo of surya
surya
Fri, 08 Sep 2023 12:52:06 UTC

In what frequency do you collect metrics and persist to db

Photo of surya
surya
Fri, 08 Sep 2023 12:52:10 UTC

Aws

Photo of Prashant
Prashant
Fri, 08 Sep 2023 14:15:19 UTC

> In what frequency do you collect metrics and persist to db depends on your collection interval. By default it should be 30s for Hostmetrics and K8s Metrics.