Incident Details
Show Details For An Incident
Resolved [05/09/2023 13:46]
As previously reported, the ultimate cause of the outage was a crash of an active switch in a virtual switch chassis at our Telehouse North PoP following the replacement of the failed standby switch. This is a procedure that we have carried out many times in the past and it has always been a hitless operation and is indeed documented as such. Following post-mortem analysis involving vendor TAC it has been concluded that the supervisor on the active switch must have entered a partially failed state when it switched over from standby to active after the switch failure the following week. Had this been visible to us in any way we would have scheduled the replacement work in an out of hours maintenance window. In light of this incident we will of course plan to carry out replacements of this nature out of hours should we see any switch failures in these systems going forwards.
This particular switch chassis had an uptime of just over six and a half years prior to the outage last week. Despite this solid stability we are now planning to move away from these virtual switch systems as part of our planned network upgrades. This will see our network transition to a more modern and efficient spine-leaf architecture where the failure of a single device will have limited to no impact to service. These upgrades will see significant investment and will be rolled out to all PoPs within the next 1-2 years.
All maintenance work at our THN PoP is now complete and its previous stability is being observed. Please accept our apologies again for the downtime witnessed.
Update [30/08/2023 16:08]
Apologies for the disruption experienced this afternoon. What should have been a straight forward replacement of failed hardware has not gone to plan. A series of unexpected issues have hampered our NOC, and this has caused knock-on service affecting issues. We are now taking these findings to Cisco TAC for review before any more works take place.
We expect all services to remain stable.
Further updates on any planned works will be shared in due course.
Update [30/08/2023 14:31]
We are aware of continued disruption at Telehouse North affecting some leased line and broadband connections. We will abandon further works today to try and restore stability. Apologies for this continued disruption.
Update [30/08/2023 13:20]
Services at Telehouse North have now returned. We expect service to remain stable.
When bringing the new replacement switch into service ,as a standby, the primary device went into a panic state and rebooted. The reboot took longer than it should have as it automatically upgraded at the same time.
Apologies, this was unexpected.
Update [30/08/2023 12:51]
There is an issue impacting Telehouse North.
Broadband connections which are offline can reconnect with a reboot.
Impacted leased lines will remain down until until the issue is resolved.
We apologise for the unexpected disruption this will cause. We are doing everything to bring service back as quickly as possible.
Update [30/08/2023 00:15]
Further replacement parts have now been received and engineers will continue working at Telehouse North to install these (Wednesday 30th August). There should be no disruption to service whilst this is taking place.
Update [28/08/2023 22:19]
Replacement parts have now been received and engineers will be working at Telehouse North to install these tomorrow (Tuesday 29th August). There should be no disruption to service whilst this is taking place.
Update [24/08/2023 18:26]
The final NNI is now back in operation, all impacted circuits should be operational. Apologies for the disruption this will have caused this afternoon.
We are working with Cisco to get replacement hardware ASAP. Further updates to follow.
Update [24/08/2023 17:28]
We are aware of one BT Wholesale NNI which remains down. NOC are working to restore service ASAP. If you have a BT Wholesale connection which remains down, you are likely impacted by this and no need to raise a fault at this stage. Apologies again for the disruption caused.
Update [24/08/2023 17:06]
All affected NNI and associated circuits should now be restore.
There should be no need to action any changes on-site, connections should simply restore. If you continue to see disruption please raise individual faults against the circuits in question.
Apologies for the disruption caused.
Identified [24/08/2023 16:45]
The cause of the issue is hardware failure. We have on-site hands moving the affected NNIs to another switch and we hope to get all circuits operational ASAP.
Investigating [24/08/2023 15:45]
This is confirmed as impacting all carriers, not just BT Wholesale. Colt, Sky and TalkTalk are also impacted. The cause appears to be linked to our switches. NOC are investigating and we hope to have an update to share shortly. Apologies for the disruption this will cause.
Investigating [24/08/2023 15:26]
We are investigating an issue impacting our BT Wholesale connectivity into Telehouse North. This will be impacting leased lines and may have had a temporary impact to broadband. As soon as we know more we will update this feed.