On a very recent set of events in my workplace, we discovered we were having problems receiving server-initiated traffic from one of our connectivity partners. This is quite unusual since all of our other connections were stable, despite the huge traffic we're getting and generating from time to time. Upon further digging, it seems that the volume wasn't the main contributor for the problem. It turns out our PDU parser written in Erlang was failing to recognize packets fragmented throughout the network.
The issue was fairly trivial in nature, and could've been prevented with certain measures, but that's not what this post is about.
Verifying the Fix
Testing the fix was rather tricky. The younger version of myself would think to test the fix in production, where it can easily be replicated. But thanks to a couple of handful experiences (failures included), this is out of the question anymore.
One of the little weird joys I do while debugging live (network-oriented) production issues, is to look into the network layer and see the packets coming in. This of course adds strain to the system load and puts the interface in promiscuous mode, so it's not something that should be turned on permanently.
My favorite tool for this task is ngrep, which can do things like normal grep on the network layer, and the ability to save the captured packets into a PCAP format. Wireshark of course helps me navigate and play with the PCAP file and see the problem in a beautifully parsed presentation.
Quick Tip: You can turn off TCP re-assembly in Wireshark to see fragmented packets individually. Preferences > Protocols > [Protocol]
Re-using Production Traffic
The general idea that was concluded when evaluating the validation step, is to re-produce the offending packets. The packet capture step was very helpful at this stage in providing the packets in PCAP format, but we needed a way to re-send them into the application without connecting to the actual server in production. Developing/extending a simulator (although we do have one) to do this is one of the options, but due to the urgency of the problem, a much quicker solution is more favourable.
Without the need to re-implement the protocol in the server side, the most promisying way is to read the PCAP file, and send it to the wire (on whoever is connected to the dummy socket). With Go, this has been very easy with the help of
Loading PCAP File
BPF is very handful when dealing with filtering network traffic. This is also used in ngrep and other packet capture tools. PCAP files used in debugging usually have broad capture filters, as the other captured information can be as useful as the main target protocol. In replaying, they don't have much use and will be considered noise. The same can be said about the unneeded side of bi-directional traffic.
More information about the filtering syntax can be found at this link: http://biot.com/capstats/bpf.html
The delay is very important if you need to recreate the fragmentation. Even if the original packets (layer 3) are fragmented, since the layer 4 responsibility was delegated to
net/http, no matter how many individual
conn.Write() we do, it will be treated as stream. When the application receives the stream in very quick succession, it may re-assemble the data with its underlying TCP/IP stack.
The tool is available freely with MIT license in this repository: https://github.com/ruel/pcaplay