On Network Fault Finding

<Rant On>

Well, its been a rather frustrating 15 hours….. most of this could have been avoided with better interaction of an incumbent. While this is a ‘rant’ post it’s rather meant to show that communication needs to be improved…

Around 5pm yesterday we experienced a massive network failure. This started off as an intermittent issue, we lost local breakout capacity  and had repeated flapping on our BGP sessions. It then escalated to a complete failure of our IPconnect from Telkom.

As many of you know Telkom is the sole supplier or capacity on the local loop – if a South African ISP wishes to supply ISP services they need to buy such capacity in a messy managed service called IPConnect. Such services are often handed off via metro-Ethernet nodes, as was the case on our network.

[As an aside, I am SO looking forward to Telkom eventually releasing their BitStream product, believed to be called IPStream. This will allow far more efficient use of IP allocations and bring in some new service offerings].

After fault finding and testing every single system on our network it came down to an actual Telkom failure. The frustration in this case was that the ME link appeared functional. In fact, our SAIX local traffic (also handed off on the ME node) was working properly. All our IPC connectivity however was dead. We already had a ticket open with Telkom and after hours of running around eventually got to make conference calls to resolve the issue. At midnight it was eventually revealed that they could not directly access their metro node to fault find further!

They had no team available to do a callout so we dispatched our own engineer to site – solving the problem was in the end rather simple … reboot the ME node!

So as a post mortem where did the problem lie — granted it seems to have been hardware related BUT in my mind it was the fact that we were left playing ticket tag for hours instead of receiving a phone call from the relevant engineers quickly and efficiently. There should be a greater sense of urgency – ISPs are not like smaller customers – there are normally  skilled engineers on hand who won’t waste the time of Telkom engineers 😉

I will be making a concerted effort to get contact numbers of relevant engineers to try short circuit the process in future. I have no problem logging a ticket – but the helldesk drones often simply don’t get the urgency or understand the issue well enough.

</Rant off>

Permanent link to this article: https://www.vdvyver.net/on-network-fault-finding/