In 2017 I performed design research on device health and activity for compliance policies and reporting in the Watson IoT Platform.
What is Device Activity
Businesses with IoT deployments have the challenge of ensuring that the entire IoT landscape operates within acceptable and expected boundaries. The consequences of IoT devices operating outside of defined criteria, or policies, could have a major impact on the security and operations of the overall IoT deployment.
The Watson IoT Platform allows the configuration of specific policies in relation to connection security. Device Activity is an indicator of device health. A device activity policy defines the expected behavior of devices and monitors for anomalies. This research identifies types of common device behavior and assesses the feasibility of various metrics for such behaviors.
Types of Device Activity
Device Activity events are
- A device connects to the IoT platform
- A device disconnects from the IoT platform
- A device submits a message to the IoT platform
- A device responds to a command from the IoT platform
Device connection events are
- Devices connecting/disconnecting directly over MQTT
- Devices connecting/disconnecting directly over HTTP REST calls*
- Devices connecting/disconnecting through a Gateway*
* Only connections over MQTT are managed by the platform with a connection state. HTTP REST API calls do not keep a connection state. A Gateway keeps a connection state for its connection, but devices connected to the platform through the Gateway do not manage their individual connection states.
Connection Use-Cases
For devices connecting to the platform, we identify four typical kinds of behaviors
- Online connection mode
- Low power connection mode
- Heartbeat connection mode
- Gateways connection mode
For each kind above we assess the feasibility of three device activity metics
- Time since last connected
- Time since the last message
- Service level
Online connection mode
The most common connection mode is the continuous online mode. This connection mode applies to devices directly instrumenting equipment to monitor and report on state. Devices continuously perform sensor readiness and send sensor data to the cloud.
Examples of such devices are sensors on the Factory floor, Medical, and Environmental applications. Such devices connect to the cloud and stay continuously connected to transfer new data. When disconnected, for example by a network disruption, the device seeks to re-establish the cloud connection when the network is restored. In summary,
- Device types – Sensors for continuous state monitoring.
- Device behavior – Online, periodic state events, instant reconnect.
Three metrics can be used to determine device activity and health.
- Time since last connected
- Time since the last message
- Service level
The time since the last connected metric does not well capture the device behavior. A well-working device would in an ideal situation have infinite time since last connected, as the device never disconnects.
The time since the last message is a more meaningful metric. If the network connectivity is lost, messages will not be received by the cloud, hence indicating a device fault. If the device does not store and forward messages, there is a risk that messages will be lost in case the sample interval coincides with a network failure.
A metric of a service level, as the relation of connected time vs time, can be used as a metric of the reliability over time of the device and hence its health. The service level metric will catch reoccurring network failures.
Comparing messaging behavior and metrics for healthy and unhealthy devices that use an online connection mode.
Low power connection mode
In applications using battery-powered devices a more conservative power consumption design have to be applied. Devices will be challenged to stay online due to limitations in battery capacity, power consumption, and radio signal strength. Examples are battery-powered devices for Sigfox / LoRa networks. Such devices only connect, send a short state event message, and disconnect. In summary
- Device types – Sensors for low-energy, low-frequency state monitoring.
- Device behavior – Connect, send state event, disconnect.
Three metrics can be used to determine device activity and health.
- Time since last connected
- Time since the last message
The time since the last connected metric works well to detect missing messages from devices. Likewise, the time since last message metric is more meaningful. A service level metric is not applicable as the device behavior is mainly to stay disconnected.
Comparing messaging behavior and metrics for healthy and unhealthy devices that use an off-line low power connection mode.
Heartbeat connection mode
Devices used for monitoring might be designed to send Alert messages only when some exceptional condition is met. For example, Leak sensors, Panic buttons, Door alarms, Sensors in appliances. To improve reliability, such devices often send heartbeat messages indicating that the device/appliance is healthy. Some devices are designed to be online, others are constrained to conservative power or network use and only connect on-demand to send a heartbeat or an alert. In summary
- Device types – Leak sensor, Panic button, Door alarms, Appliances
- Device behavior – Continuous Online / Offline (low power) monitoring and event-based alerts. Edge analytics device.
Three metrics can be used to determine device activity and health.
- Time since last connected
- Time since the last message
The time since last connected metric works well for offline device behavior (as above). Time since the last message/heartbeat is more meaningful. A service level metric is not applicable as the device behavior may be to stay disconnected.
Gateways
Gateways are network devices that connect a variety of devices to the cloud. Gateways are used in many connection scenarios. For example, connecting devices over a range of protocols, implementing a bridge from the edge to the cloud, or even a bridge across IoT platforms. Gateways may also add messaging capabilities to improve reliability, like storing messages in a buffer if the gateway is disconnected and forwards buffered messages once the gateway reconnects. A gateway both improves device reliability, but also skews metrics on device activity and health.
A Gateway is a device that connects to the platform. It hence uses the three connection modes discussed above. The most common cause is the online connection mode. The gateway will hence have (be given) a device activity policy that is different from the devices that the gateway acts on behalf of.
Also, as discussed above, a Gateway keeps a connection state for its connection, but devices connected to the platform through the Gateway do not manage their individual connection states. This makes a device’s health metrics based on time since last connected problematic.
A device activity metric based on time since the last message is more valuable as applied to individual devices as well as the gateway itself. Such policies should be set individually for the gateway and the devices that connect through the gateway. Her devices may have a short time interval between messages, others may have longer time intervals, as indicated in the figure below.
Comparing messaging behavior and metrics for healthy and unhealthy devices that connect to the cloud using a gateway.
User research
In our design research, we find the following consensus among our design partners
- Importance of a Device Activity policy: Very Important
- Most important device activity to measure: Time since last message
- Minimum time window set for this policy; Set by type or instance. Minimally 1h time interval.
- Importance to support devices connecting through a Gateway: Important
- Most common device behavior: Always connected, periodic messages | disconnected in energy save mode | other protocols
Related Designs
Risk and Security Management Design.
Read more about the design of the Watson IoT Platform Risk and Security Management policies.
Device Activity Policy Design.
Read more about the design of the Device Activity Policy design for the Watson IoT Platform Risk and Security Management policies.
Device Activity Policy Design.
Read more about the design of the Device Activity Policy design for the Watson IoT Platform Risk and Security Management policies.
Risk and Security Management Design.
Read more about the design of the Watson IoT Platform Risk and Security Management policies.