In 2017 I performed design research on device health and activity for compliance policies and reporting in the Watson IoT Platform.
What is Device Activity
Businesses with IoT deployments have the challenge of ensuring that the entire IoT landscape operates within acceptable and expected boundaries. The consequences of IoT devices operating outside of defined criteria, or policies, could have a major impact on the security and operations of the overall IoT deployment.
The Watson IoT Platform allows the configuration of specific policies in relation to connection security. Device Activity is an indicator of device health. A device activity policy defines the expected behavior of devices and monitors for anomalies. This research identifies types of common device behavior and assesses the feasibility of various metrics for such behaviors.
Types of Device Activity
Device Activity events are
- A device connects to the IoT platform
- A device disconnects from the IoT platform
- A device submits a message to the IoT platform
- A device responds to a command from the IoT platform
Device connection events are
- Devices connecting/disconnecting directly over MQTT
- Devices connecting/disconnecting directly over HTTP REST calls*
- Devices connecting/disconnecting through a Gateway*
* Only connections over MQTT are managed by the platform with a connection state. HTTP REST API calls do not keep a connection state. A Gateway keeps a connection state for its connection, but devices connected to the platform through the Gateway do not manage their individual connection states.
Connection Use-Cases
For devices connecting to the platform, we identify four typical kinds of behaviors
- Online connection mode
- Low power connection mode
- Heartbeat connection mode
- Gateways connection mode
For each kind above we assess the feasibility of three device activity metics
- Time since last connected
- Time since the last message
- Service level
Online connection mode
The most common connection mode is the continuous online mode. This connection mode applies to devices directly instrumenting equipment to monitor and report on state. Devices continuously perform sensor readiness and send sensor data to the cloud.
Examples of such devices are sensors on the Factory floor, Medical, and Environmental applications. Such devices connect to the cloud and stay continuously connected to transfer new data. When disconnected, for example by a network disruption, the device seeks to re-establish the cloud connection when the network is restored. In summary,
- Device types – Sensors for continuous state monitoring.
- Device behavior – Online, periodic state events, instant reconnect.
Three metrics can be used to determine device activity and health.
- Time since last connected
- Time since the last message
- Service level
The time since the last connected metric does not well capture the device behavior. A well-working device would in an ideal situation have infinite time since last connected, as the device never disconnects.
The time since the last message is a more meaningful metric. If the network connectivity is lost, messages will not be received by the cloud, hence indicating a device fault. If the device does not store and forward messages, there is a risk that messages will be lost in case the sample interval coincides with a network failure.
A metric of a service level, as the relation of connected time vs time, can be used as a metric of the reliability over time of the device and hence its health. The service level metric will catch reoccurring network failures.
Comparing messaging behavior and metrics for healthy and unhealthy devices that use an online connection mode.