In July, we had a second major incident this year affecting the flespi platform's network availability. Our monthly uptime immediately dropped from 100% to 99.8094%, and our Enterprise and Ultimate plan users once again received automatic 30% and 70% discounts on their flespi monthly fee due to our SLA uptime guarantee.
Compared to the previous incident in April, this one was of a different nature but looked very similar on the surface. On July 21, early in the morning – at 04:27 AM EEST – one of the racks at our network connectivity provider was disconnected from the power supply. By a very unfortunate coincidence, that exact rack housed a critical networking segment for us. The resulting blackout immediately cut off access to one of our datacenters from the Internet and also disrupted connectivity between both parallel datacenters.
Approximately an hour later, the local engineering team in the datacenter resolved the incident by restoring power to the rack with core networking equipment. After 1 hour and 25 minutes, the flespi platform was reachable from the Internet again, and connectivity between our datacenters was restored. By that time, we expected a huge after-load once devices started sending their accumulated messages to channels – however, lessons learned from the previous incident and already implemented changes allowed us to resume smooth operation within minutes after the restore. To prevent any possible unnoticed impact, we also resynchronized all data caches on the same day – just in case.
Looking in the internal logs later, we noticed that our second datacenter was operable and correctly served read-only REST API calls; however, PUT, POST, and PATCH operations for the primary entities were not available and returned 500 response codes. Devices that were configured to use the channel domain name also operated smoothly; their telematics data and remote control were reachable via MQTT and the REST API served from this datacenter as well.
Still, half of DNS queries responded with the IP addresses that were served by the unreachable datacenter, so depending on the client library you are using and the type of API calls, flespi could work or not for you during this incident. We had an option to quickly switch DNS to a single operating datacenter, but decided to wait for the datacenter engineer's resolution first.
We still haven't received the full report about the reasons for the rack power outage. And again, by an unlucky coincidence, we were already in the process of replacing router hardware uplinks with redundant 4x10G bonded systems which, upon provisioning, will make incidents of a similar nature a non-issue for us. So, from a networking configuration perspective, the upgrade is already in progress – we were just unlucky to have the incident before we secured it.
From the positive perspective, we now know that our platform is much better prepared for the next datacenter blackout. And also, we know what to do to improve it further. The most valuable components of our platform are its circuit breakers, retries, and automated fallbacks – the majority of which were built in response to this and other outages. No simulation can replace real experience gathered through the pain and failures.
Back to platform engineering. In July, despite an active vacation season, we continued to constantly improve tacho and video functionality, adopting various devices and protocols equipped with freshly released and legacy firmwares. You can find detailed descriptions of what has been done by looking into the corresponding protocol changelog. For example, within the ruptela changelog, you can find a large list of updates and features committed for the tacho functionality.
One of the most interesting features that can be valuable for your integration is the possibility to add custom meta information to commands sent to devices. This significantly improves traceability of remote device management operations.
For assets, we added a concept of active (ongoing) interval changes to which are published into the MQTT retained state topic 'flespi/state/gw/assets/{id}/active'
. With this enhancement, it is now possible to use asset intervals for future planning and to stay current on the device currently assigned to the asset (e.g., device assigned to the trailer or driver) and how it changes in real-time.
We already had a whitelisted IP addresses list in channel configuration, and now added the possibility to configure blacklisted IP addresses – e.g., from which IP addresses or networks the channel will drop TCP connections and UDP packets.
And we introduced a new maximum UDP packets per minute limit to channels to prevent UDP floods to your channels.
Jan did a great overview of the upcoming Telematics & Connected Mobility conference agenda. We cruelly cut all kinds of marketing puffs and left only the best selection of speakers – well-known experts in the telematics industry. Most probably, it is the only conference where you can find these people who define how telematics looks today and how it should look tomorrow. Wialon, Navixy, Traccar, WebFleet, and many more leading fleet management platforms and IoT solutions providers gathered for two days under the same roof. Could you imagine this at all? Just one month left till the event. If you haven't jumped in yet – please don’t wait, register today and reserve your seat. And let me remind you that 50% discount codes are still available for commercial flespi users – you can request them from our team through the flespi chat.
Just to warm up before the conference (and we will have many OEM representatives there whom the visitors can talk to directly), I shared my own thoughts on OEM telematics from a Telematics Service Provider’s perspective, grown on aftermarket device installations. And how OEM telematics, if developed correctly, can shape your telematics service business in the next years.
Wish you an uninterruptible power supply to all your data centers, racks, and servers in them. And of course, have a great August and enjoy the remaining summer (if you are from the Northern Hemisphere)!