We Built a Tool to Diagnose ScyllaDB Kubernetes Issues

Cynthia Dunlop

Introducing Scylla Operator Analyze, a tool to help platform engineers and administrators deploy ScyllaDB clusters running on Kubernetes Imagine it’s a Friday afternoon. Your company is migrating all the data to ScyllaDB and you’re in the middle of setting up the cluster on Kubernetes. Then, something goes wrong. Your time today is limited, but the sheer volume of ScyllaDB configuration feels endless. To help you detect problems in ScyllaDB deployments, we built Scylla Operator Analyze, a command-line tool designed to automatically analyze Kubernetes-based ScyllaDB clusters, identify potential misconfigurations, and offer actionable diagnostics. In modern infrastructure management, Kubernetes has revolutionized how we orchestrate containers and manage distributed systems. However, debugging complex Kubernetes deployments remains a significant challenge, especially in production-grade, high-performance environments like those powered by ScyllaDB. In this blog post, we’ll explain what Scylla Operator Analyze is, how it works, and how it may help platform engineers and administrators deploy ScyllaDB clusters running on Kubernetes. The repo we’ve been working on is available here. It’s a fork of Scylla Operator, but the project hasn’t been merged upstream (it’s highly experimental). What is Scylla Operator Analyze? Scylla Operator Analyze is a Go-based command-line utility that extends Scylla Operator by introducing a diagnostic command. Its goal is straightforward: automatically inspect a given Kubernetes deployment and report problems it identified in the deployment configuration. We designed our tool to help ScyllaDB’s technical support staff to quickly diagnose known issues reported by our clients, both by providing solutions for simple problems, and helpful insights in more complex cases. However, it’s also freely available as a subcommand of the Scylla Operator binary. The next few sections share how we implemented the tool. If you want to go straight to example usage, skip to the Making a diagnosis section. Capturing the cluster state Kubernetes deployments consist of many components with various functions. Collectively, they are called resources. The Kubernetes API presents them to the client as objects containing fields with information about their configuration and current state. Two modes of operation Scylla Operator Analyze supports two ways of collecting these data: Live Cluster Connection The tool can connect directly to a Kubernetes cluster using the client-go API. Once connected, it retrieves data from Kubernetes resources and compiles it into an internal representation. Archive-Based Analysis (Must-Gather) Alternatively, the tool can analyze archived cluster states created using a utility called must-gather. These archives contain YAML descriptions of resources, allowing offline analysis. Diagnosis by analyzing symptoms Symptoms are high-level objects representing certain issues that could occur while deploying a ScyllaDB cluster. A symptom contains the diagnosis of the problem and a suggestion on how to fix it, as well as a method for checking if the problem occurs in a given deployment (we cover this in the section about selectors). In order to create objects representing more complex problems, symptoms can be used to create tree-like structures. For example, a problem that could manifest itself in a few different ways could be represented by many symptoms checking for all the different spots the problem could affect. Those symptoms would be connected to one root symptom, describing the cause of the problem. This way, if any of the sub-symptoms report that their condition is met, the tool can display the root cause instead of one specific manifestation of that problem. Example of a symptom and the workflow used to detect it. In this example, let’s assume that the driver is unable to provide storage, but NodeConfig does not report a nonexistent device. When checking if the symptom occurs, the tool will perform the following steps. Check if the NodeConfig reports a nonexistent device – no Check if the driver is unable to provide storage – yes. At this point we know the symptom occurs, so we don’t need to check for any more subsymptoms. Since one of the subsymptoms occurs, the main symptom (NodeConfig configured with nonexistent volume) is reported to the user. Deployment condition description Resources As described earlier, Kubernetes deployments can be considered collections of many interconnected resources. All resources are described using so-called fields. Fields contain information identifying resources, deployment configuration and descriptions of past and current states. Together, these data give the controllers all the information they need to supervise the deployment. Because of that, they are very useful for debugging issues and are the main source of information for our tool. Resources’ fields contain a special kind field, which describes what the resource is and indicates what other fields are available. Some fundamental Kubernetes resource kinds include Pods, Services, etc. Those can also be extended with custom ones, such as the ScyllaCluster resource kind defined by the Scylla Operator. This provides the most basic kind of grouping of resources in Kubernetes. Other fields are grouped in sections called Metadata, which provide identifying information, Spec, which contain configuration and Status, which contain current status. Such a description in YAML format may look something like this: apiVersion: v1 kind: Pod metadata: creationTimestamp: "2024-12-03T17:47:06Z" labels: scylla/cluster: scylla scylla/datacenter: us-east-1 scylla/scylla-version: 6.2.0 name: scylla-us-east-1-us-east-1a-0 namespace: scylla spec: volumes: - name: data persistentVolumeClaim: claimName: data-scylla-us-east-1-us-east-1a-0 status: conditions: - lastTransitionTime: "2024-12-03T17:47:06Z" message: '0/1 nodes are available: pod has unbound immediate PersistentVolumeClaims. preemption: 0/1 nodes are available: 1 Preemption is not helpful for scheduling.' reason: Unschedulable status: "False" type: PodScheduled phase: Pending Selectors An accurate description of symptoms (presented in the previous section) requires a method for describing conditions in the deployment using information contained in the resources’ fields. Moreover, because of the distributed nature of both Kubernetes deployments and ScyllaDB, these descriptions must also specify how the resources are related to one another. Our tool comes with a package providing selectors. They offer a simple, yet powerful, way to describe deployment conditions using Kubernetes objects in a way that’s flexible and allows for automatic processing using the provided selection engine. A selector can be thought of as a query because it specifies the kinds of resources to select and criteria which they should satisfy. Selectors are constructed using four main methods of the selector structure builder. First, the developer specifies resources to be selected with the Select method by specifying their kind and a predicate which should be true for the selected resources. The predicate is provided as a standard Go closure to allow for complex conditions if needed. Next, the developer may call the Relate method to define a relationship between two kinds of resources. This is again defined using a Go closure as a predicate, which must hold for the two objects to be considered in the same result set. This can establish a context within which an issue should be inspected (for example: connecting a Pod to relevant Storage resources). Finally, constraints for individual resources in the result set can be specified with the Where method, similarly to how it is done in the Select method. This method is mainly meant to be used with the SelectWithNil method. The SelectWithNil method is the same as the Select method; the only difference is that it allows returning a special nil value instead of a resource instance. This nil value signifies that no resources of a given kind match all the other resources in the resulting set. Thanks to this, selectors can also be used to detect a scenario where a resource is missing just by examining the context of related resources. An example selector — shortened for brevity — may look something like this: selector. New(). Select("scylla-pod", selector.Type[*v1.Pod](), func(p *v1.Pod) (bool, error) { /* ... */ }). SelectWithNil("storage-class", selector.Type[*storagev1.StorageClass](), nil). Select("pod-pvc", selector.Type[*v1.PersistentVolumeClaim](), nil). Relate("scylla-pod", "pod-pvc", func(p *v1.Pod, pvc *v1.PersistentVolumeClaim) (bool, error) { for _, volume := range p.Spec.Volumes { vPvc := volume.PersistentVolumeClaim if vPvc != nil && (*vPvc).ClaimName == pvc.Name { return true, nil } } return false, nil }). Relate("pod-pvc", "storage-class", /* ... */). Where("storage-class", func(sc *storagev1.StorageClass) (bool, error) { return sc == nil, nil }) In symptom definitions, selectors for a corresponding condition are used and are usually constructed alongside them. Such a selector provides a description of a faulty condition. This means that if there is a matching set of resources, it can be inferred that the symptom occurs. Finally, the selector can then be used, given all the deployments resources, to construct an iterator-like object that provides a list of all the sets of resources that match the selector. Symptoms can then use those results to detect issues and generate diagnoses containing useful debugging information. Making a diagnosis When a symptom relating to a problematic condition is detected, a diagnosis for a user is generated. Diagnoses are automatically generated report objects summarizing the problem and providing additional information. A diagnosis consists of an issue description, identifiers of resources related to the fault, and hints for the user (when available). Hints may contain, for example, a description of steps to remedy the issue or a reference to a bug tracker. In the final stage of analysis, those diagnoses are presented to the user and the output may look something like this: Diagnoses: scylladb-local-xfs StorageClass used by a ScyllaCluster is missing Suggestions: deploy scylladb-local-xfs StorageClass (or change StorageClass) Resources GVK: /v1.PersistentVolumeClaim, scylla/data-scylla-us-east-1-us-east-1a-0 (4…) scylla.scylladb.com/v1.ScyllaCluster, scylla/scylla (b6343b79-4887-497b…) /v1.Pod, scylla/scylla-us-east-1-us-east-1a-0 (0e716c3f-6432-4eeb-b5ff-…) Learn more As we suggested, Kubernetes deployments of ScyllaDB involve many interacting components, each of which has its own quirks. Here are a few strategies to help in diagnosing the problems you encounter: Run Scylla Doctor Check our troubleshooting guide Look for open issues on our GitHub Check our forum Ask us on Slack Learn more about ScyllaDB at ScyllaDB University Good luck, fellow troubleshooter!

We Built a Tool to Diagnose ScyllaDB Kubernetes Issues

Managing ScyllaDB Background Operations with Task Manager

The Cost of Multitenancy

Cache vs. Database: How Architecture Impacts Performance

11X Faster ScyllaDB Backup

P99 CONF 2025 Recap: Latency to LLMs

Building a Movie Recommendation App with ScyllaDB Vector Search

The Fourth Pillar of Observability: Recognizing Diagnostic Procedures

Building a Low-Latency Vector Search Engine for ScyllaDB

Building Meesho’s ML Platform: Lessons from the First-Gen System

Inside the Database Internals Talks at P99 CONF 2025

ScyllaDB X Cloud: An Inside Look with Avi Kivity (Part 3)

The Latency vs. Complexity Tradeoffs with 6 Caching Strategies

ScyllaDB X Cloud: An Inside Look with Avi Kivity (Part 2)

Why Cache Data? [Latency Book Excerpt]

ScyllaDB X Cloud: An Inside Look with Avi Kivity (Part 1)

Be Part of Something Big – Speak at Monster Scale Summit

Beyond Apache Cassandra

How GE Healthcare Took DynamoDB on Prem for Its AI Platform

Real-Time Read Database Heavy Workloads: Considerations and Best Practices

Alan Shimel and Dor Laor on Database Elasticity, ScyllaDB X Cloud