The Structured Approach to Network Troubleshooting

Have you been on the receiving end of a call informing you that there was an “impacting event”, “service-affecting outage”, or “mission-critical failure”? If so, it probably came at the least opportune time; sitting down at a restaurant, just as you’re getting off work, or even in the middle of the night. This inspires a haze of panic as you feel all that you thought you were going to be doing getting pushed aside for the emergency that’s come up. However, when the fog clears, taking a logical approach can work wonders in getting the problem identified and rectified quickly.

This is where the OSI model comes into play. The Open Systems Interconnection model is a logical separation of the nature of communication into seven layers that build upon each other.  

The Open Systems Interconnection

The users/consumers are keenly aware of the Application layer at the top. It is possible to troubleshoot from the top down until you locate the issue, but validating from the bottom up is more straightforward in most cases. We’re going to focus on the Physical layer because when there have been no deliberate changes made, this layer is most subject to outside influences.

What to do When There are Errors:

When a link drops or is taking errors, the transceivers and platforms involved are your first hints in figuring out what’s gone wrong. Looking at your logs to review what was going on at the time of the incident is a great first step. Most transceivers have what is called Digital Diagnostics Monitoring (DDM) that will allow you to inspect elements of the device under investigation. DDM can indicate what power the transmit has, what value the receiver is seeing, and other potentially helpful information such as temperature and power statistics. The values that the transceiver reads are not always considered a perfect reference, but they can give you an idea of what’s going on. If the transceiver indicates a warning or an alarm, that’s a smoking gun right there.

Now that you have the transceiver’s opinion on the values, it’s time to verify! Is the transceiver actually producing as strong a signal as it believes? Do the values for Tx and Rx line up with the internal accounting? Different light meters are used for different transceiver types, but DWDM, CWDM, and Graywave all benefit from employing the right meter to measure their Tx and Rx signals accurately. If the transceiver thinks there’s no light and the meter thinks there is no light, it could be that there’s no light.  Sometimes the transceiver sees a problem with the incoming signal being received or thinks that its transmit is stronger than the meter reads it. And what does that mean?

When issues arise, especially when the meter and the transceiver seem to disagree on levels, a common culprit is surface contamination. Even on optics that have been functioning for a long time, a fleck outside of the operational area can migrate and spread due to ambient vibration until (at the worst possible time) it becomes an issue. It is best practice to clean all mating surfaces any time connections are made or examined. You would be surprised at how little it takes to contaminate an endface.

Using a fiber scope, you can clear away any contamination doubt by directly inspecting fibers and optics. Good cleaning practice is important, but some stains are stubborn. And once a dirty fiber is inserted into an optic, you have to assume that you now have a dirty fiber AND a dirty optic. Fiber scopes like the Integra SmartProbe Wireless 2 can even provide automatic qualification that the endface is cleaned appropriately.

In the event that the meter and the transceiver agree that levels are not where they are supposed to be and you know that dirt is not the obvious cause, it’s time to validate the fiber itself. On duplex transceivers, proper function requires the integrity of both fibers from the A end to the Z end. Before checking the full transit, start by validating your fiber locally. A very neat tool for doing that is the Visual Fault Locator (VFL). The VFL shoots a visible light down the fiber that is so bright that it causes any breaks or bend damaged areas to glow, even through the fiber’s jacket! It should be noted that even in a protected environment, fiber is vulnerable to drooping over time and slowly adding attenuation to a circuit. Or someone might have pulled or leaned on it. These things do happen. In these cases, having a good assortment of replacement fibers, either carried by each technician or stored on-site in a fiber library, is crucial. Running a temporary jumper that is the wrong length for a tidy installation just makes one outage into multiple. Either from a second fiber issue or when it comes time to replace the temporary run with a permanent one.

When everything seems good on-site, but you know that there’s a signal issue, it’s time to break out the big guns. The Optical Time Domain Reflectometer (OTDR). It’s got an imposing name, but modern OTDRs are easy to use. Basically, they shoot a laser down a fiber and record the reflections and their characteristics. There are two things to be aware of when using an OTDR. First, they have to “shoot” on a fiber that doesn’t have a signal on it. This means you have to either remotely shut down or disconnect the far side. Second, the start of a fiber shot has a blind area where accuracy is sacrificed. That is where a “launch reel” comes in. These are just a small box or pouch with a coil of fiber that you put in-line with the fiber you’re testing. Once you run the test, it shows you a lot of information about your fiber span in an easy-to-read graph. You can see the attenuation on a span, splices, and span distance.  In the real world, being able to compare historical OTDR reports against the current reading is not something you’re always going to be able to do during an outage, but issues are often not subtle at all when it comes to fiber damage on a path.

When you break down diagnosis into verifiable pieces you effectively narrow down the cause of the issue. At first, it’s sometimes overwhelming to confront the abstract problem. However, by using the right tools and maintaining a structured approach, even the weirdest network events yield to logic and reason.

May your network be trouble-free and your on-call periods be quiet! Reach out to your Integra sales or engineering team for additional support or to discuss how we can help troubleshoot your network with the right approach and tools!