Degraded network, causing calling outage

Incident Report for SIPcity

Postmortem

This network upgrade completes the final part of a DevOps project shifting all our services into highly available (HA) Containers, including distributed file systems for storage.

Specifically, the 9 October migration deprecated our legacy Asterisk network with OpenSIPs for signalling, load balancing, and session border control. The Australian migration was our second upgrade within the last two months, with our US subsidiaries 2talk and Vaitel completing the same in early September.

While Asterisk remains part of the solution mix, SIP Peering, Registration, Presence (BLF), certificate management, security, and a new push notification service are now managed by OpenSIPs with media separated onto specialist RTP proxies.

With the deprecation of the legacy Asterisk platform, all services now reside on HA containers, also removing the complexity of managing separate networks, one of which was end-of-life. Unfortunately, a malformed certificate impacted the migration, causing instability within the OpenSIPs stack, specifically the OpenSSL module. Specifically, the certificate blf.sipcity.com.au which accounts for less than 1% of registered users, didn’t present until Monday morning load. While we pinpointed the cause of the TCP stack, even with support from the OpenSIPs owner, it took another eight hours to isolate to this individual certificate.

As to why we didn’t abort, I balanced several competing tensions at the time. This was our second OpenSIPs go-live within a month, culminating in many thousands of planning hours. With four full-time engineers including direct consultancy from OpenSIP we understood our new platform, within the parameters of our own environment. Secondly, the severity of the failure made little sense to our team or even the Project Developer who has numerous Tier 1 carriers also using the same long-term stable version of OpenSIPs.

Following the certificate resolution, we received issues with differences between our US and Australian operations. A group of PBX machines wasn’t adhering to the SIP RFC 3261, particularly 401 auth and e164 number presentation. Finally, a difference in Call Flow priority between the US and Australia caused some initial Forwarding issues.

As I write this RFO I report the global issues are resolved, albeit we are seeing some legacy issues to which our team continues to diagnose.

Ultimately, my decision to push through with the upgrade balanced the investment made within our applications against the reality that aborting didn’t resolve the actual SSL certificate issue. As the system has stabilised, the release of our wider communications solutions is now only starting to diminish the memory and shock of this seriously compromised go-live.

Yours sincerely,

Mike Johnstone
CEO | SIPcity

Posted Oct 29, 2021 - 14:12 AEDT

Resolved

This incident has been resolved.

Posted Oct 10, 2021 - 12:49 AEDT

Monitoring

A fix has been implemented and we are monitoring the results.

Posted Oct 09, 2021 - 12:58 AEDT

Identified

We have registrations and calls back up.
We still are having issues with features such as queue etc but this will be resolved shortly.

Posted Oct 09, 2021 - 11:11 AEDT

Update

We are continuing to implement a solution to resolve the issue.
At this time we have disabled access to Arena while we resolve the issue.

Posted Oct 09, 2021 - 09:26 AEDT

Investigating

As part of this mornings migration, we identified within the TCP stack which we are working to resolve.

Posted Oct 09, 2021 - 08:22 AEDT

This incident affected: SIPcity Calling Platform (Network) and SIPcity Cloud PBX.