Design Research on IoT Device Activity

by

Device Activity Policy Design. 
Read more about the design of the Device Activity Policy design for the Watson IoT Platform Risk and Security Management policies.
Risk and Security Management Design. 
Read more about the design of the Watson IoT Platform Risk and Security Management policies.

In 2017 I performed design research on device health and activity for compliance policies and reporting in the Watson IoT Platform.

What is Device Activity

Businesses with IoT deployments have the challenge of ensuring that the entire IoT landscape operates within acceptable and expected boundaries. The consequences of IoT devices operating outside of defined criteria, or policies, could have a major impact on the security and operations of the overall IoT deployment. 

The Watson IoT Platform allows the configuration of specific policies in relation to connection security. Device Activity is an indicator of device health. A device activity policy defines the expected behaviour of devices and monitors for anomalies. This research identifies types of common device behaviour and assesses the feasibility of various metrics to such behaviours.

Types of Device Activity

Device Activity events are

  • A device connects to the IoT platform
  • A device disconnects from the IoT platform
  • A device submits a message to the IoT platform
  • A device responds to a command from the IoT platform

Device connection events are

  • Devices connecting / disconnecting directly over MQTT
  • Devices connecting / disconnecting directly over HTTP REST calls*
  • Devices connecting / disconnecting through a Gateway*

* Only connections over MQTT are managed by the platform with a connection state. HTTP REST API calls do not keep a connection state. A Gateway keep a connection state for its connection, but devices connected to the platform though the Gateway do not manage their individual connection states.  

Connection Use-Cases

For devices connecting to the platform we identify four typical kind of behaviors

  • On-line connection mode
  • Low power connection mode
  • Heartbeat connection mode
  • Gateways connection mode

For each kind above we assess the feasibility of three device activity metics

  • Time since last connected 
  • Time since last message 
  • Service level

On-line connection mode

The most common connection mode is the continuous on-line mode. This connection mode apply to devices directly instrumenting equipment to monitor and report on state. Devices continuously perform sensor readiness and send sensor data to the cloud.

Examples of such devices are sensors in Factory floor, Medical and Environmental applications. Such devices connect to the cloud and stay continuously connected to transfer new data. When disconnected, for example by a network disruption, the device seeks to re-establish the cloud connection when the network is restored. In summary, 

  • Device types – Sensors for continuous state monitoring. 
  • Device behaviour – On line, periodic state events, instant reconnect.

Three metrics can be used to determine device activity and health.

  • Time since last connected 
  • Time since last message 
  • Service level

The time since last connected metric do not well capture the device behaviour. A well working device would in an ideal situation have infinite time since last connected, as the device never disconnects. 

The time since last message is a more meaningful metric. If the network connectivity is lost, messages will not be received by the cloud, and hence indicating a device fault. If the device do not store and forward messages, there is a risk that messages will be lost in case the sample interval coincide with a network failure.

A metric of a service level, as the relation of connected time vs time, can be used as a metric of the reliability over time of the device and hence its health. The service level metric will catch reoccurring network failures.

DeviceConnectionPolicy01

Comparing messaging behaviour and metrics for healthy and unhealthy devices that use an on-line connection mode. 

Low power connection mode

In applications using battery-powered devices a more conservative power consumption design have to be applied. Devices will be challenged to stay on-line due to limitations in battery capacity, power consumption and radio signal strength. Examples are battery-powered devices for Sigfox / LoRa networks. Such devices only connects, send a short state event message, and disconnect. In summary

  • Device types – Sensors for low-energy, low-frequency state monitoring. 
  • Device behavior – Connect, send state event, disconnect.

Three metrics can be used to determine device activity and health.

  • Time since last connected
  • Time since last message

The time since last connected metric works well to detect missing messages from devices. Likewise, time since last message metric is more meaningful. A service level metric is not applicable as the device behavior is mainly to stay disconnected.

DeviceConnectionPolicy01

Comparing messaging behaviour and metrics for healthy and unhealthy devices that use an off-line low power connection mode. 

Heartbeat connection mode

Devices used for monitoring might be designed to send Alert messages only when some exceptional condition is met. For example,  Leak sensor, Panic button, Door alarms, Sensors in appliances. To improve reliability, such devices often send heartbeat messages indicating that the device / appliance is healthy. Some devices are designed to be on-line, others are constrained to conservative power or network use and only connect on demand to send a heartbeat or an alert. In summary

  • Device types – Leak sensor, Panic button, Door alarms, Appliances
  • Device behavior – Continuous On-line / Off-line (low power) monitoring and event based alerts. Edge analytics device.

Three metrics can be used to determine device activity and health.

  • Time since last connected
  • Time since last message

The time since last connected metric works well for off-line device behaviour (as above). Time since last message / heartbeat, is more meaningful. A service level metrics is not applicable as the device behaviour may be to stay disconnected.

DeviceConnectionPolicy01

Comparing messaging behaviour and metrics for healthy and unhealthy devices that use an on-line / off-line heartbeat connection mode. 

Gateways

Gateways are network devices that connect a variety of devices to the cloud. Gateways are used in many connection scenarios. For examples, connecting devices over a range of protocols, or implementing a bridge from the edge to the cloud, or even a bridge across IoT platforms. Gateways may also add messaging capabilities to improve reliability, like storing messages in a buffer if the gateway is disconnected and forwards buffered messages once the gateway reconnects. A gateway both improves device reliability, but also skews metrics on device activity and health. 

A Gateway is a device that connects to the platform. It hence uses of the of three connection modes discussed above. The most common case is the on-line connection mode. The gateway will hence have (be given) a device activity policy that is different from the devices that the gateway acts on behalf of.

Also, as discussed above, a Gateway keep a connection state for its connection, but devices connected to the platform though the Gateway do not manage their individual connection states. This makes a device health metrics based on time since last connected problematic.

A device activity metric based on time since last message is more valuable as applied to individual devices as well as the gateway itself. Such policies should be set individually for the gateway and the devices that connect through the gateway. She devices may have a short time interval between messages, other may have longer time intervals, as indicated in the figure below.

DeviceConnectionPolicy01

Comparing messaging behaviour and metrics for healthy and unhealthy devices that connect to the cloud using a gateway. 

User research 

In our design research we find the following consensus among our design partners

  • Importance of a Device Activity policy: Very Important
  • Most important device activity to measure: Time since last message
  • Minimum time window set for this policy; Set by type or instance. Minimally 1h time interval.
  • Importance to support devices connecting through a Gateways: Important
  • Most common device behaviour: Always connected, periodic messages | disconnected in energy save mode | other protocols

Related Designs

Risk and Security Management Design. 
Read more about the design of the Watson IoT Platform Risk and Security Management policies.
Device Activity Policy Design. 
Read more about the design of the Device Activity Policy design for the Watson IoT Platform Risk and Security Management policies.
PageLines