Skip to content
Stand with Ukraine flag

Troubleshooting

If you are using RETRY_ALL, RETRY_FAILED, RETRY_TIMED_OUT, or RETRY_FAILED_AND_TIMED_OUT strategy for a rule-engine queue, a failed node can block all message processing in that queue.

Here is what you can do to identify the cause:

  • Analyze the Rule Engine Statistics Dashboard. Check whether any messages failed or timed out. Exception details, including the failing rule node’s name, appear at the bottom of the dashboard.

  • After identifying the failing rule node, enable DEBUG to see which messages trigger the failure and examine the detailed error.

Tip: Separate unstable and test use cases from production by creating a dedicated queue. Failures then affect only that queue, not the whole system. Configure this automatically per device using the Device Profile feature.

Tip: Handle Failure events for all rule nodes that connect to external services (REST API, Kafka, MQTT, etc.) to prevent rule-engine processing from stopping when an external system fails. You can store the failed message in the database, send a notification, or log it.

You may experience growing message processing latency in the rule-engine. Here are the steps to diagnose the cause:

  • Check if there are timeouts in the Rule Engine Statistics Dashboard. Timeouts in rule-nodes slow down the processing of the queue and can lead to latency.

  • Check CPU usage for the following services:

    • ThingsBoard services (tb-nodes, tb-rule-engine and tb-core nodes, transport nodes). High CPU load on some services means that you need to scale up that part of the system.
    • PostgreSQL and pgpool (if you are in high-availability mode). High load on Postgres can lead to slow processing of all Postgres-related rule-nodes (saving attributes, reading attributes etc), and the system in general.
    • Cassandra (if you are using Cassandra as storage for timeseries data). High load on Cassandra can lead to slow processing of all Cassandra-related rule-nodes (saving timeseries etc).
    • Queue. Regardless of the queue type, make sure that it always has enough resources.
  • Check consumer-group lag (if you are using Kafka as queue).

  • Enable Message Pack Processing Log. It will allow you to see the name of the slowest rule-node.

  • Separate use cases with dedicated queues. If a group of devices requires isolated processing, configure a separate rule-engine queue for that group. You can also route messages to different queues using logic in the Root rule chain. This ensures slow processing of one use case does not affect others.

Check for Failures, Timeouts, and Exceptions during rule-chain processing. For more details, see the Rule Engine Statistics section.

Consumer Group Message Lag for Kafka Queue

Section titled “Consumer Group Message Lag for Kafka Queue”

Use this metric to identify message processing issues. Since the queue handles all system messaging, you can monitor not only rule-engine queues but also transport, core, and others. For details on troubleshooting rule-engine processing with consumer-group lag, see the Rule Engine Monitoring page.

If a service lacks resources, check CPU and memory usage by logging into the server/container/pod and running the top command.

For continuous monitoring, configure Prometheus and Grafana.

If a service consistently reaches 100% CPU, scale it horizontally by adding cluster nodes or vertically by increasing CPU allocation.

Enable logging of the slowest and most frequently called rule nodes by adding the following logger to your logging configuration:

<logger name="org.thingsboard.server.service.queue.TbMsgPackProcessingContext" level="DEBUG" />

The following entries will then appear in your logs:

2021-03-24 17:01:21,023 [tb-rule-engine-consumer-24-thread-3] DEBUG o.t.s.s.q.TbMsgPackProcessingContext - Top Rule Nodes by max execution time:
2021-03-24 17:01:21,024 [tb-rule-engine-consumer-24-thread-3] DEBUG o.t.s.s.q.TbMsgPackProcessingContext - [Main][3f740670-8cc0-11eb-bcd9-d343878c0c7f] max execution time: 1102. [RuleChain: Thermostat|RuleNode: Device Profile Node(3f740670-8cc0-11eb-bcd9-d343878c0c7f)]
2021-03-24 17:01:21,024 [tb-rule-engine-consumer-24-thread-3] DEBUG o.t.s.s.q.TbMsgPackProcessingContext - [Main][3f6debf0-8cc0-11eb-bcd9-d343878c0c7f] max execution time: 1. [RuleChain: Thermostat|RuleNode: Message Type Switch(3f6debf0-8cc0-11eb-bcd9-d343878c0c7f)]
2021-03-24 17:01:21,024 [tb-rule-engine-consumer-24-thread-3] INFO o.t.s.s.q.TbMsgPackProcessingContext - Top Rule Nodes by avg execution time:
2021-03-24 17:01:21,024 [tb-rule-engine-consumer-24-thread-3] INFO o.t.s.s.q.TbMsgPackProcessingContext - [Main][3f740670-8cc0-11eb-bcd9-d343878c0c7f] avg execution time: 604.0. [RuleChain: Thermostat|RuleNode: Device Profile Node(3f740670-8cc0-11eb-bcd9-d343878c0c7f)]
2021-03-24 17:01:21,025 [tb-rule-engine-consumer-24-thread-3] INFO o.t.s.s.q.TbMsgPackProcessingContext - [Main][3f6debf0-8cc0-11eb-bcd9-d343878c0c7f] avg execution time: 1.0. [RuleChain: Thermostat|RuleNode: Message Type Switch(3f6debf0-8cc0-11eb-bcd9-d343878c0c7f)]
2021-03-24 17:01:21,025 [tb-rule-engine-consumer-24-thread-3] INFO o.t.s.s.q.TbMsgPackProcessingContext - Top Rule Nodes by execution count:
2021-03-24 17:01:21,025 [tb-rule-engine-consumer-24-thread-3] INFO o.t.s.s.q.TbMsgPackProcessingContext - [Main][3f740670-8cc0-11eb-bcd9-d343878c0c7f] execution count: 2. [RuleChain: Thermostat|RuleNode: Device Profile Node(3f740670-8cc0-11eb-bcd9-d343878c0c7f)]
2021-03-24 17:01:21,028 [tb-rule-engine-consumer-24-thread-3] INFO o.t.s.s.q.TbMsgPackProcessingContext - [Main][3f6debf0-8cc0-11eb-bcd9-d343878c0c7f] execution count: 1. [RuleChain: Thermostat|RuleNode: Message Type Switch(3f6debf0-8cc0-11eb-bcd9-d343878c0c7f)]

Cached data can become corrupted. Clearing the cache is always safe — ThingsBoard repopulates it at runtime. To clear it, log into the server/container/pod, open the command-line tool (redis-cli for Redis or valkey-cli for Valkey), and run FLUSHALL. In Sentinel mode, access the master container and run the same command.

If you cannot identify the cause of a problem, clear the cache to rule it out.

Regardless of the deployment type, ThingsBoard logs are stored on the same server/container as the ThingsBoard Server/Node in the following directory:

Terminal window
/var/log/thingsboard

Different deployment types provide different ways to view logs:

View last logs in runtime:

Terminal window
tail -f /var/log/thingsboard/thingsboard.log

Use grep to filter output by a specific string. For example, to check for backend errors:

Terminal window
cat /var/log/thingsboard/thingsboard.log | grep ERROR

ThingsBoard lets you enable or disable logging for specific components depending on what you need for troubleshooting.

Modify the logback.xml file, located on the same server/container as ThingsBoard, in the following directory:

Terminal window
/usr/share/thingsboard/conf

Here’s an example of the logback.xml configuration:

<!DOCTYPE configuration>
<configuration scan="true" scanPeriod="10 seconds">
<appender name="fileLogAppender"
class="ch.qos.logback.core.rolling.RollingFileAppender">
<file>/var/log/thingsboard/thingsboard.log</file>
<rollingPolicy
class="ch.qos.logback.core.rolling.SizeAndTimeBasedRollingPolicy">
<fileNamePattern>/var/log/thingsboard/thingsboard.%d{yyyy-MM-dd}.%i.log</fileNamePattern>
<maxFileSize>100MB</maxFileSize>
<maxHistory>30</maxHistory>
<totalSizeCap>3GB</totalSizeCap>
</rollingPolicy>
<encoder>
<pattern>%d{ISO8601} [%thread] %-5level %logger{36} - %msg%n</pattern>
</encoder>
</appender>
<logger name="org.thingsboard.server" level="INFO" />
<logger name="org.thingsboard.js.api" level="TRACE" />
<logger name="com.microsoft.azure.servicebus.primitives.CoreMessageReceiver" level="OFF" />
<root level="INFO">
<appender-ref ref="fileLogAppender"/>
</root>
</configuration>

The most useful config elements for troubleshooting are the loggers, which enable or disable logging per class or package. In the example above, the default level is INFO (general info, warnings, and errors), while org.thingsboard.js.api is set to the most detailed logging level. Logging can also be completely disabled for a component — as shown for com.microsoft.azure.servicebus.primitives.CoreMessageReceiver using the OFF level.

To change logging for a component, add or update the <logger> entry and wait up to 10 seconds for the change to take effect.

Different deployment types require different steps to apply the updated configuration:

Update /usr/share/thingsboard/conf/logback.xml to change the logging configuration.

Enable Prometheus metrics by setting METRICS_ENABLED to true and METRICS_ENDPOINTS_EXPOSE to prometheus in the configuration file.

When running ThingsBoard as microservices with separate MQTT and CoAP transport services, also set WEB_APPLICATION_ENABLE to true, WEB_APPLICATION_TYPE to servlet, and HTTP_BIND_PORT to 8081 for those services.

Metrics are available at https://<yourhostname>/actuator/prometheus (no authentication required).

The following internal state metrics are exposed via Spring Actuator to Prometheus.

  • attributes_queue_{index_of_queue} (statsNames — totalMsgs, failedMsgs, successfulMsgs): stats for writing attributes to the database. Several queues (threads) handle attribute persistence for maximum throughput.
  • ruleEngine_{name_of_queue} (statsNames — totalMsgs, failedMsgs, successfulMsgs, tmpFailed, failedIterations, successfulIterations, timeoutMsgs, tmpTimeout): stats for Rule Engine message processing, per queue (e.g., Main, HighPriority, SequentialByOriginator). Stat descriptions:
    • tmpFailed: number of messages that failed and got reprocessed later
    • tmpTimeout: number of messages that timed out and got reprocessed later
    • timeoutMsgs: number of messages that timed out and were discarded afterwards
    • failedIterations: iterations of processing messages pack where at least one message wasn’t processed successfully
  • ruleEngine_{name_of_queue}_seconds (for each present tenantId): stats about the time message processing took for different queues.
  • core (statsNames — totalMsgs, toDevRpc, coreNfs, sessionEvents, subInfo, subToAttr, subToRpc, deviceState, getAttr, claimDevice, subMsgs): stats for internal system message processing:
    • toDevRpc: number of processed RPC responses from Transport services
    • sessionEvents: number of session events from Transport services
    • subInfo: number of subscription infos from Transport services
    • subToAttr: number of subscribes to attribute updates from Transport services
    • subToRpc: number of subscribes to RPC from Transport services
    • getAttr: number of ‘get attributes’ requests from Transport services
    • claimDevice: number of Device claims from Transport services
    • deviceState: number of processed changes to Device State
    • subMsgs: number of processed subscriptions
    • coreNfs: number of processed specific ‘system’ messages
  • jsInvoke (statsNames — requests, responses, failures): stats for total, successful, and failed requests to JS executors
  • attributes_cache (results — hit, miss): stats about how many attribute requests went to the cache
  • transport (statsNames — totalMsgs, failedMsgs, successfulMsgs): stats for requests received by Transport from TB nodes
  • ruleEngine_producer (statsNames — totalMsgs, failedMsgs, successfulMsgs): stats for messages pushed from Transport to the Rule Engine
  • core_producer (statsNames — totalMsgs, failedMsgs, successfulMsgs): stats for messages pushed from Transport to the TB node Device actor
  • transport_producer (statsNames — totalMsgs, failedMsgs, successfulMsgs): stats for requests from Transport to TB

Some metrics depend on the type of database you are using to persist timeseries data.

  • ts_latest_queue_{index_of_queue} (statsNames — totalMsgs, failedMsgs, successfulMsgs): stats for writing latest telemetry to the database. Several queues (threads) maximize write throughput.
  • ts_queue_{index_of_queue} (statsNames — totalMsgs, failedMsgs, successfulMsgs): stats for writing telemetry to the database. Several queues (threads) maximize write throughput.
  • rateExecutor_currBuffer: number of messages that are currently being persisted inside Cassandra.
  • rateExecutor_tenant (for each present tenantId): number of requests that got rate-limited
  • rateExecutor (statsNames — totalAdded, totalRejected, totalLaunched, totalReleased, totalFailed, totalExpired, totalRateLimited). Stats descriptions:
    • totalAdded: number of messages that were submitted for persisting
    • totalRejected: number of messages that were rejected while trying to submit for persisting
    • totalLaunched: number of messages sent to Cassandra
    • totalReleased: number of successfully persisted messages
    • totalFailed: number of messages that were not persisted
    • totalExpired: number of expired messages that were not sent to Cassandra
    • totalRateLimited: number of messages that were not processed because of the Tenant’s rate-limits

You can import preconfigured Grafana dashboards from this repository.

Grafana dashboards are also available when deploying the ThingsBoard Docker Compose cluster. See the Docker Compose cluster setup guide for details. Set MONITORING_ENABLED to true before deployment. Once running, Prometheus is available at http://localhost:9090 and Grafana at http://localhost:3000 (default credentials: admin / foobar).

Sometimes after configuring OAuth you cannot see the button for logging in with an OAuth provider. This happens when Domain name and Redirect URI Template contain faulty values — they need to match the URL you use to access your ThingsBoard web page.

Base URLDomain nameRedirect URI Template
http://mycompany.com:8080mycompany.com:8080http://mycompany.com:8080/login/oauth2/code
https://mycompany.commycompany.comhttps://mycompany.com/login/oauth2/code

For OAuth2 configuration, see OAuth 2.0 Support.

  • GitHub Project — check out the project and consider contributing.
  • Stack Overflow — ask questions tagged with thingsboard; the ThingsBoard team monitors this tag.
  • Contact us — if your problem isn’t answered by any of the guides above, contact the ThingsBoard team directly.