flespi noc (eu)
115 subscribers
119 links
flespi eu region NOC
Download Telegram
Last downtime was due to a controllable primary/secondary routing switch operation performed by the uplink provider during the maintenance. Something failed in this regular process which will be investigated and fixed by our provider.
Due to a nature of this failure flespi network availability in both datacenters was interrupted. We apologize for any inconveniences.
#eu: downtime started, error: Failed to perform https://flespi.io GET request. Usually this indicates either flespi eu datacenter network uplink connection problem or when the platform is in the maintenance mode.
#eu: downtime ended, period: 95 second(s)
The latest downtime was during the network uplink channel switch with unstable vxvlan equipment operation. The equipment vendor is investigating the issue to provide the resolution.
Sorry for any inconveniences occurred!
#eu: downtime started, error: Failed to perform https://flespi.io/gw/xxx GET request. Usually this indicates either flespi telematics hub REST API overload or when the hub is in the maintenance mode.
#eu: downtime ended, period: 302 second(s)
The recent 5-minute downtime experienced was due to a power issue at the core switch rack of the uplink provider. This incident led to the immediate disconnection of the second datacenter from both the Internet and the first datacenter. For the first datacenter the primary impact was on the MQTT Broker system, where increased latencies were observed during subscription processes. Conversely, the REST API system, particularly the keep-alive connections, remained functional. DNS resolve queries for flespi.io and mqtt.flespi.io services remained successful in 50% cases.

MQTT is a critical component of the platform, and its disruption caused failures in the routine tests run from multiple locations. This triggered a downtime notification, despite the first datacenter being fully operational. The uplink provider is currently investigating the power issue. Concurrently, efforts are being made to enhance the resilience and performance of the multi-datacenter configuration to prevent similar issues in the future.

Apologies for any inconveniences that occurred!
#eu: downtime started, error: Failed to perform REST API call that performs meta-data modifications. This usually indicates that all meta-data database operations are unavailable.
#eu: downtime ended, period: 62 second(s)
Latest downtime was triggered by a failure of the server where is located a node of our MQTT Broker service. The server suddenly disconnected form the network and within a 1 minute timeout until the problem was automatically detected most operations involved MQTT performed with an additional 3 seconds latency. This latency in all operations was long enough to be detected by the majority of our NOC uptime checking nodes which were unable to complete all 20 tests operations withing 20 seconds.

We will continue incident investigation on Monday and apply additional protective measures to ensure that similar problems will not affect the system operation in the future.

Have a great weekend!
Dear users!

On Saturday, January 27 from 07:30 till 09:00 CET our uplink provider is preforming the major network upgrade and will switch its core network provider.
We don't expect impact or anomalies. But it is possible you could observe connection resets because they are going to force routes to the new routers and providers.

We will keep an eye on this, monitor any change from flespi side and hope that everything will go smooth.

Have a great week,
Your flespi team!
#eu: downtime started, error: Failed to perform https://flespi.io/gw/xxx GET request. Usually this indicates either flespi telematics hub REST API overload or when the hub is in the maintenance mode.
#eu: downtime ended, period: 24 second(s)
#eu: downtime started, error: Failed to perform https://flespi.io/gw/xxx GET request. Usually this indicates either flespi telematics hub REST API overload or when the hub is in the maintenance mode.
#eu: downtime ended, period: 164 second(s)
Dear users!

We apologize for the recent downtime that partially impacted the REST subsystem, causing REST API requests to execute more slowly than usual. Our engineers have identified and resolved the issue. We appreciate your understanding and apologize for any inconvenience this may have caused.

Your flespi team!
Dear users!

This week we are upgrading our infrastructure in both data centers to a next stable OS version. Due to the internal architecture of the flespi platform and distributed data centers there should not be any visible signs of this process except those services that rely on constant TCP connection - MQTT Broker and device connections to channels.

Once the upgrade script reaches the server with such services you may notice in channel logs bursts of device connected/disconnected events. Same to your MQTT client connections to our MQTT Broker. When we upgrade OS on routers and switch routing systems - this may also affect established TCP connections.

However the expected impact is minimal and both your devices and MQTT clients should be able immediately reconnect to the flespi platform.
We plan to finish the upgrade process by the end of this week.

Have a wonderful day,
Your flespi team!
#eu: downtime started, error: Failed to receive messages posted by the simulated device with GET /gw/channels/XXX/messages REST API call within 5 seconds. It usually means that flespi storage system is either shutdowned for maintenance or currently operating under high load and some database operations may be delayed.
#eu: downtime ended, period: 384 second(s)
Recent downtime was related to a storage buffer subsystem that was not able to correctly handle the peak load during high-load operation.

Storage buffer subsystem is internally used by flespi channels during posting or retrieval of their messages. So posting messages to some channels and reading them via API were performed with a noticeable delay for around 6 minutes (we trigger a downtime state when such delay is more than 5 seconds).

You access this buffer with GET /gw/channels/{ch-id} REST API call which was the only one API call affected by the downtime.

Access to messages posted to devices, MQTT Broker and all other components of the flespi platform were fully operable and unaffected.

We are already analyzing the situation and in the nearest time will introduce a fix for it.

This downtime was not related to the OS upgrade on servers which is running smoothly and we plan to finish it today.

Sorry for any inconveniences!