17 April, 2019

Undercover agents or how we keep an eye on data parsing errors

The smart process helping the flespi team to effectively fix protocol inconsistencies and make users happy.

You might have found yourself in the following situation:
... once you connect a new device to a flespi channel, your channel log in Toolbox is flushed with parsing errors
... nothing seems to be working
... sometimes even device messages are not registered
... but in a little while everything starts working as if by itself — errors disappear, device messages are registered correctly and go their way through the pipeline.

Looks like magic. But every act of magic is only partly magical. The major part of the magic is done for you by the flespi team with the help of flespi technologies.

The flespi team always strives to do the best platform. And it’s true regarding telematics data parsing as well. That’s why we created a special protocol description technology (PVM) that helps us quickly integrate new protocols into the flespi telematics hub

But fully integrating 300+ GPS trackers at a time is as impossible as swallowing an elephant in one gulp (unless you are a giant python, of course). Moreover, this would be a short-sighted and unsustainable approach to protocols development. Much wiser is to support those devices and those parts of protocols that are most demanded by our users. With such an approach, parsing errors that are usually considered a bad thing, turn out to be a good thing — they disclose what unsupported modules/subprotocols are required by the users. That's why we are keeping track of data parsing errors that occur in flespi channels.

The art of error handling

To stay current on the parsing errors we created a special service, called tgerr (telematics gateway errors):

flespi parsing errors management

tgerr works in conjunction with tg services that execute PVM code and parse data from telematics devices. Every time a parsing error occurs, tg stores the data packet that caused an error and generates a “tg/pvmerror” event into the flespi internal data bus — mbus. tgerr listens to “tg/pvmerror” events from mbus and notifies flespi pvm developers about new parsing errors in the Telegram group:

parsing error telegram notification

The mechanism looks pretty straightforward so far, but what turns it into a true work of art is smart filtering that makes the notification service informative, yet not redundant or spammy — the one notifying only about the events that are useful for subscribers and skipping the useless noise. 

Duplicate errors

Say, you’ve connected a queclink device and configured it to send a report type that is not supported in flespi yet. Each data packet of unsupported type causes a parsing error in your queclink channel. Just imagine us getting all these notifications each time!

Once tgerr receives a new error, it notifies developers about it and saves the error in the cache of “active” errors for 24 hours (this will prevent it from reappearing in the next day). What makes tgerr a true masterpiece of the notifications art is the use of HASD technology to make the errors cache persistent. tgerr synchronizes the cache in flespi's retained-messages-based storage inside flespi MQTT broker to always have it updated and never lose data in case of reconnects.

Errors beyond our control

There are some errors that we are aware of but can do nothing with, e.g. improper implementation of the wialon_ips protocol that causes an error:

pvm_exec_func_nmea_longitude: invalid NMEA minute: '03760.0000'

The notifications about such errors are useless, so not to distract developers tgerr keeps a list of ignore rules that is also stored in HASD. And what comes as a handy bonus to storing all this stuff in HASD is the possibility to view and edit these rules and errors in a smart and friendly MQTT Board tool:

parsing errors in mqtt board

What happens next or the role of human

Once we receive a notification about an error, a responsible developer analyzes the corresponding data packet. In some cases, we fix or enhance protocol implementation. In other cases, we contact the device manufacturer and report bugs in the protocol implementation in the device firmware. And sometimes we contact you via Helpbox, for example, to let you know that you’ve chosen an improper device type when creating your device instance in flespi.

***

This is how we did all the magic of smart error handling using the superpower of the flespi in-house ecosystem.

So, don’t worry about data parsing errors in your flespi channels — they are taken care of. And don’t give up using flespi if the newly connected device floods you with parsing errors instead of device messages. 

Surely, not everything can be fixed by the flespi team but we do our best to minimize the data loss and maximize data reception sustainability.