Troubleshooting
Possible Performance Issues
Section titled “Possible Performance Issues”No Message Processing
Section titled “No Message Processing”If you are using RETRY_ALL, RETRY_FAILED, RETRY_TIMED_OUT, or RETRY_FAILED_AND_TIMED_OUT strategy for a rule-engine queue, a failed node can block all message processing in that queue.
Here is what you can do to identify the cause:
-
Analyze the Rule Engine Statistics Dashboard. Check whether any messages failed or timed out. Exception details, including the failing rule node’s name, appear at the bottom of the dashboard.
-
After identifying the failing rule node, enable DEBUG to see which messages trigger the failure and examine the detailed error.
Tip: Separate unstable and test use cases from production by creating a dedicated queue. Failures then affect only that queue, not the whole system. Configure this automatically per device using the Device Profile feature.
Tip: Handle Failure events for all rule nodes that connect to external services (REST API, Kafka, MQTT, etc.) to prevent rule-engine processing from stopping when an external system fails. You can store the failed message in the database, send a notification, or log it.
Growing Latency for Messages
Section titled “Growing Latency for Messages”You may experience growing message processing latency in the rule-engine. Here are the steps to diagnose the cause:
-
Check if there are timeouts in the Rule Engine Statistics Dashboard. Timeouts in rule-nodes slow down the processing of the queue and can lead to latency.
-
Check CPU usage for the following services:
- ThingsBoard services (tb-nodes, tb-rule-engine and tb-core nodes, transport nodes). High CPU load on some services means that you need to scale up that part of the system.
- PostgreSQL and pgpool (if you are in high-availability mode). High load on Postgres can lead to slow processing of all Postgres-related rule-nodes (saving attributes, reading attributes etc), and the system in general.
- Cassandra (if you are using Cassandra as storage for timeseries data). High load on Cassandra can lead to slow processing of all Cassandra-related rule-nodes (saving timeseries etc).
- Queue. Regardless of the queue type, make sure that it always has enough resources.
-
Check consumer-group lag (if you are using Kafka as queue).
-
Enable Message Pack Processing Log. It will allow you to see the name of the slowest rule-node.
-
Separate use cases with dedicated queues. If a group of devices requires isolated processing, configure a separate rule-engine queue for that group. You can also route messages to different queues using logic in the Root rule chain. This ensures slow processing of one use case does not affect others.
Troubleshooting Instruments and Tips
Section titled “Troubleshooting Instruments and Tips”Rule Engine Statistics Dashboard
Section titled “Rule Engine Statistics Dashboard”Check for Failures, Timeouts, and Exceptions during rule-chain processing. For more details, see the Rule Engine Statistics section.
Consumer Group Message Lag for Kafka Queue
Section titled “Consumer Group Message Lag for Kafka Queue”Use this metric to identify message processing issues. Since the queue handles all system messaging, you can monitor not only rule-engine queues but also transport, core, and others. For details on troubleshooting rule-engine processing with consumer-group lag, see the Rule Engine Monitoring page.
CPU/Memory Usage
Section titled “CPU/Memory Usage”If a service lacks resources, check CPU and memory usage by logging into the server/container/pod and running the top command.
For continuous monitoring, configure Prometheus and Grafana.
If a service consistently reaches 100% CPU, scale it horizontally by adding cluster nodes or vertically by increasing CPU allocation.
Message Pack Processing Log
Section titled “Message Pack Processing Log”Enable logging of the slowest and most frequently called rule nodes by adding the following logger to your logging configuration:
<logger name="org.thingsboard.server.service.queue.TbMsgPackProcessingContext" level="DEBUG" />The following entries will then appear in your logs:
2021-03-24 17:01:21,023 [tb-rule-engine-consumer-24-thread-3] DEBUG o.t.s.s.q.TbMsgPackProcessingContext - Top Rule Nodes by max execution time:2021-03-24 17:01:21,024 [tb-rule-engine-consumer-24-thread-3] DEBUG o.t.s.s.q.TbMsgPackProcessingContext - [Main][3f740670-8cc0-11eb-bcd9-d343878c0c7f] max execution time: 1102. [RuleChain: Thermostat|RuleNode: Device Profile Node(3f740670-8cc0-11eb-bcd9-d343878c0c7f)]2021-03-24 17:01:21,024 [tb-rule-engine-consumer-24-thread-3] DEBUG o.t.s.s.q.TbMsgPackProcessingContext - [Main][3f6debf0-8cc0-11eb-bcd9-d343878c0c7f] max execution time: 1. [RuleChain: Thermostat|RuleNode: Message Type Switch(3f6debf0-8cc0-11eb-bcd9-d343878c0c7f)]2021-03-24 17:01:21,024 [tb-rule-engine-consumer-24-thread-3] INFO o.t.s.s.q.TbMsgPackProcessingContext - Top Rule Nodes by avg execution time:2021-03-24 17:01:21,024 [tb-rule-engine-consumer-24-thread-3] INFO o.t.s.s.q.TbMsgPackProcessingContext - [Main][3f740670-8cc0-11eb-bcd9-d343878c0c7f] avg execution time: 604.0. [RuleChain: Thermostat|RuleNode: Device Profile Node(3f740670-8cc0-11eb-bcd9-d343878c0c7f)]2021-03-24 17:01:21,025 [tb-rule-engine-consumer-24-thread-3] INFO o.t.s.s.q.TbMsgPackProcessingContext - [Main][3f6debf0-8cc0-11eb-bcd9-d343878c0c7f] avg execution time: 1.0. [RuleChain: Thermostat|RuleNode: Message Type Switch(3f6debf0-8cc0-11eb-bcd9-d343878c0c7f)]2021-03-24 17:01:21,025 [tb-rule-engine-consumer-24-thread-3] INFO o.t.s.s.q.TbMsgPackProcessingContext - Top Rule Nodes by execution count:2021-03-24 17:01:21,025 [tb-rule-engine-consumer-24-thread-3] INFO o.t.s.s.q.TbMsgPackProcessingContext - [Main][3f740670-8cc0-11eb-bcd9-d343878c0c7f] execution count: 2. [RuleChain: Thermostat|RuleNode: Device Profile Node(3f740670-8cc0-11eb-bcd9-d343878c0c7f)]2021-03-24 17:01:21,028 [tb-rule-engine-consumer-24-thread-3] INFO o.t.s.s.q.TbMsgPackProcessingContext - [Main][3f6debf0-8cc0-11eb-bcd9-d343878c0c7f] execution count: 1. [RuleChain: Thermostat|RuleNode: Message Type Switch(3f6debf0-8cc0-11eb-bcd9-d343878c0c7f)]Clearing Redis/Valkey Cache
Section titled “Clearing Redis/Valkey Cache”Cached data can become corrupted. Clearing the cache is always safe — ThingsBoard repopulates it at runtime. To clear it, log into the server/container/pod, open the command-line tool (redis-cli for Redis or valkey-cli for Valkey), and run FLUSHALL. In Sentinel mode, access the master container and run the same command.
If you cannot identify the cause of a problem, clear the cache to rule it out.
Read Logs
Section titled “Read Logs”Regardless of the deployment type, ThingsBoard logs are stored on the same server/container as the ThingsBoard Server/Node in the following directory:
/var/log/thingsboardDifferent deployment types provide different ways to view logs:
View last logs in runtime:
tail -f /var/log/thingsboard/thingsboard.logUse grep to filter output by a specific string.
For example, to check for backend errors:
cat /var/log/thingsboard/thingsboard.log | grep ERRORView last logs in runtime:
docker compose logs -f tb-core1 tb-core2 tb-rule-engine1 tb-rule-engine2To view only rule-engine logs:
docker compose logs -f tb-rule-engine1 tb-rule-engine2Use grep to filter output by a specific string.
For example, to check for backend errors:
docker compose logs tb-core1 tb-core2 tb-rule-engine1 tb-rule-engine2 | grep ERRORTip: Redirect logs to a file for offline analysis:
docker compose logs -f tb-rule-engine1 tb-rule-engine2 > rule-engine.logTo access logs directly inside the container:
docker psdocker exec -it NAME_OF_THE_CONTAINER bashView all pods of the cluster:
kubectl get podsView last logs for the desired pod:
kubectl logs -f POD_NAMETo view ThingsBoard node logs:
kubectl logs -f tb-node-0Use grep to filter output by a specific string.
For example, to check for backend errors:
kubectl logs -f tb-node-0 | grep ERRORTo redirect logs from all nodes to local files for analysis:
kubectl logs -f tb-node-0 > tb-node-0.logkubectl logs -f tb-node-1 > tb-node-1.logTo access logs directly inside the container:
kubectl exec -it tb-node-0 -- bashcat /var/log/thingsboard/tb-node-0/thingsboard.logEnable Certain Logs
Section titled “Enable Certain Logs”ThingsBoard lets you enable or disable logging for specific components depending on what you need for troubleshooting.
Modify the logback.xml file, located on the same server/container as ThingsBoard, in the following directory:
/usr/share/thingsboard/confHere’s an example of the logback.xml configuration:
<!DOCTYPE configuration><configuration scan="true" scanPeriod="10 seconds">
<appender name="fileLogAppender" class="ch.qos.logback.core.rolling.RollingFileAppender"> <file>/var/log/thingsboard/thingsboard.log</file> <rollingPolicy class="ch.qos.logback.core.rolling.SizeAndTimeBasedRollingPolicy"> <fileNamePattern>/var/log/thingsboard/thingsboard.%d{yyyy-MM-dd}.%i.log</fileNamePattern> <maxFileSize>100MB</maxFileSize> <maxHistory>30</maxHistory> <totalSizeCap>3GB</totalSizeCap> </rollingPolicy> <encoder> <pattern>%d{ISO8601} [%thread] %-5level %logger{36} - %msg%n</pattern> </encoder> </appender>
<logger name="org.thingsboard.server" level="INFO" /> <logger name="org.thingsboard.js.api" level="TRACE" /> <logger name="com.microsoft.azure.servicebus.primitives.CoreMessageReceiver" level="OFF" />
<root level="INFO"> <appender-ref ref="fileLogAppender"/> </root></configuration>The most useful config elements for troubleshooting are the loggers, which enable or disable logging per class or package. In the example above, the default level is INFO (general info, warnings, and errors), while org.thingsboard.js.api is set to the most detailed logging level. Logging can also be completely disabled for a component — as shown for com.microsoft.azure.servicebus.primitives.CoreMessageReceiver using the OFF level.
To change logging for a component, add or update the <logger> entry and wait up to 10 seconds for the change to take effect.
Different deployment types require different steps to apply the updated configuration:
Update /usr/share/thingsboard/conf/logback.xml to change the logging configuration.
The /config folder inside the container is mapped to your local system (./tb-node/conf folder).
Update ./tb-node/conf/logback.xml to change the logging configuration.
Kubernetes uses a ConfigMap to provide tb-nodes with logback configuration. To update logback.xml:
edit common/tb-node-configmap.ymlkubectl apply -f common/tb-node-configmap.ymlAfter 10 seconds the changes will be applied to the logging configuration.
Metrics
Section titled “Metrics”Enable Prometheus metrics by setting METRICS_ENABLED to true and METRICS_ENDPOINTS_EXPOSE to prometheus in the configuration file.
When running ThingsBoard as microservices with separate MQTT and CoAP transport services, also set WEB_APPLICATION_ENABLE to true, WEB_APPLICATION_TYPE to servlet, and HTTP_BIND_PORT to 8081 for those services.
Metrics are available at https://<yourhostname>/actuator/prometheus (no authentication required).
Prometheus Metrics
Section titled “Prometheus Metrics”The following internal state metrics are exposed via Spring Actuator to Prometheus.
tb-node Metrics
Section titled “tb-node Metrics”attributes_queue_{index_of_queue}(statsNames — totalMsgs, failedMsgs, successfulMsgs): stats for writing attributes to the database. Several queues (threads) handle attribute persistence for maximum throughput.ruleEngine_{name_of_queue}(statsNames — totalMsgs, failedMsgs, successfulMsgs, tmpFailed, failedIterations, successfulIterations, timeoutMsgs, tmpTimeout): stats for Rule Engine message processing, per queue (e.g., Main, HighPriority, SequentialByOriginator). Stat descriptions:- tmpFailed: number of messages that failed and got reprocessed later
- tmpTimeout: number of messages that timed out and got reprocessed later
- timeoutMsgs: number of messages that timed out and were discarded afterwards
- failedIterations: iterations of processing messages pack where at least one message wasn’t processed successfully
ruleEngine_{name_of_queue}_seconds(for each present tenantId): stats about the time message processing took for different queues.- core (statsNames — totalMsgs, toDevRpc, coreNfs, sessionEvents, subInfo, subToAttr, subToRpc, deviceState, getAttr, claimDevice, subMsgs): stats for internal system message processing:
- toDevRpc: number of processed RPC responses from Transport services
- sessionEvents: number of session events from Transport services
- subInfo: number of subscription infos from Transport services
- subToAttr: number of subscribes to attribute updates from Transport services
- subToRpc: number of subscribes to RPC from Transport services
- getAttr: number of ‘get attributes’ requests from Transport services
- claimDevice: number of Device claims from Transport services
- deviceState: number of processed changes to Device State
- subMsgs: number of processed subscriptions
- coreNfs: number of processed specific ‘system’ messages
- jsInvoke (statsNames — requests, responses, failures): stats for total, successful, and failed requests to JS executors
- attributes_cache (results — hit, miss): stats about how many attribute requests went to the cache
Transport Metrics
Section titled “Transport Metrics”- transport (statsNames — totalMsgs, failedMsgs, successfulMsgs): stats for requests received by Transport from TB nodes
- ruleEngine_producer (statsNames — totalMsgs, failedMsgs, successfulMsgs): stats for messages pushed from Transport to the Rule Engine
- core_producer (statsNames — totalMsgs, failedMsgs, successfulMsgs): stats for messages pushed from Transport to the TB node Device actor
- transport_producer (statsNames — totalMsgs, failedMsgs, successfulMsgs): stats for requests from Transport to TB
Some metrics depend on the type of database you are using to persist timeseries data.
PostgreSQL-Specific Metrics
Section titled “PostgreSQL-Specific Metrics”ts_latest_queue_{index_of_queue}(statsNames — totalMsgs, failedMsgs, successfulMsgs): stats for writing latest telemetry to the database. Several queues (threads) maximize write throughput.ts_queue_{index_of_queue}(statsNames — totalMsgs, failedMsgs, successfulMsgs): stats for writing telemetry to the database. Several queues (threads) maximize write throughput.
Cassandra-Specific Metrics
Section titled “Cassandra-Specific Metrics”- rateExecutor_currBuffer: number of messages that are currently being persisted inside Cassandra.
- rateExecutor_tenant (for each present tenantId): number of requests that got rate-limited
- rateExecutor (statsNames — totalAdded, totalRejected, totalLaunched, totalReleased, totalFailed, totalExpired, totalRateLimited). Stats descriptions:
- totalAdded: number of messages that were submitted for persisting
- totalRejected: number of messages that were rejected while trying to submit for persisting
- totalLaunched: number of messages sent to Cassandra
- totalReleased: number of successfully persisted messages
- totalFailed: number of messages that were not persisted
- totalExpired: number of expired messages that were not sent to Cassandra
- totalRateLimited: number of messages that were not processed because of the Tenant’s rate-limits
Grafana Dashboards
Section titled “Grafana Dashboards”You can import preconfigured Grafana dashboards from this repository.
Grafana dashboards are also available when deploying the ThingsBoard Docker Compose cluster. See the Docker Compose cluster setup guide for details.
Set MONITORING_ENABLED to true before deployment. Once running, Prometheus is available at http://localhost:9090 and Grafana at http://localhost:3000 (default credentials: admin / foobar).
OAuth2
Section titled “OAuth2”Sometimes after configuring OAuth you cannot see the button for logging in with an OAuth provider. This happens when Domain name and Redirect URI Template contain faulty values — they need to match the URL you use to access your ThingsBoard web page.
| Base URL | Domain name | Redirect URI Template |
|---|---|---|
http://mycompany.com:8080 | mycompany.com:8080 | http://mycompany.com:8080/login/oauth2/code |
https://mycompany.com | mycompany.com | https://mycompany.com/login/oauth2/code |
For OAuth2 configuration, see OAuth 2.0 Support.
Getting Help
Section titled “Getting Help”- GitHub Project — check out the project and consider contributing.
- Stack Overflow — ask questions tagged with
thingsboard; the ThingsBoard team monitors this tag. - Contact us — if your problem isn’t answered by any of the guides above, contact the ThingsBoard team directly.