Strange Routing Problem

Author

144

May 17, 2004 11:15 AM

I've run into a strange problem with our servers at work, and I can't figure out for the life of me what would cause this. Luckily, I've figured out a hacky way to solve the problem, but I'm still left wondering why this would happen. We have two rack mounted servers behind a hardware load balancer at a remote location... I'll call them WWW1 and WWW2. What happens is that WWW2 will inexplicably lose connectivity to WWW1 at random times. I'll try to ping WWW1 and it spits back a no route to host error. It would not fix itself unless I rebooted both servers, at which time it would usually work for 5 or 10 minutes and then die again. This infuriated me to no end, as I use an rsync cron job to update shared data between WWW2 an WWW1, and I was unable to keep WWW2 up to date during this time. Anyway, here's the really strange part to me... I found that when this occurred, if I logged into WWW1 and pinged WWW2, the ping would pause for a second and then start getting responses - at which point I could ping WWW1 again from WWW2, and all was back to normal. What the hell? I fixed it by setting up a cron job once a minute on WWW1 to ping WWW2 once, but I don't like that solution and I would rather figure out WHY it's happening. If anyone has run into this before I'd appreciate some commentary.

[edited by - SantaClaws on May 17, 2004 12:17:28 PM]

Sneftel

1,788

May 17, 2004 12:20 PM

I''m guessing that some piece of equipment (including, possibly, one of the routers themselves) is feeding WWW2 bogus routing info, causing it to forget where WWW1 can be found. When it receives a ping from WWW1, it uses the link that the ping came in on as a new route in its table. Just conjecture, but it makes sense to me.

"Sneftel is correct, if rather vulgar." --Flarelocke

Interim

122

May 17, 2004 01:23 PM

Watch your ARP tables as well. When it doesn''t work, see if you can find out if the ARP entry for your servers is correct in the device that can''t ping the other one.

If you have access to the switch, try that as well. Sounds like a classic duplicate IP on a switch table item. A second device might be stating he''s the IP address of the other machine, which the switch will use to update it''s table and send packets the wrong way. It could be a sign of an ARP poison attack, but that sounds unlikely.

The ping solution sounds to me like it just keeps the ARP from expiring, so you don''t get the "bogus" machine IP-ARP resolution.

If you want a more solid solution and you don''t control the network, hardcode your ARPs for each box. Just keep this in mind if you ever swap out network cards though.

It may also be a behavior of your load balancer. Any updates to it lately? Or network changes? Load balancers typically work on MAC addresses or IP address, perhaps it might have had its configuration altered?

Int.

SantaClaws

Author

144

May 17, 2004 03:33 PM

Thanks for the help. It seems you were right. I disabled the ping and it happened again, and I did a comparison on the ARP. Lo and behold...

Before:

quote:
# /sbin/arp -vn -e
Address HWtype HWaddress Flags Mask Iface
1.2.3.4 ether 00:0D:60:19:AD:AA C eth0

After:

quote:
# /sbin/arp -vn -e
Address HWtype HWaddress Flags Mask Iface
1.2.3.4 (incomplete) eth0

I''m not experienced with this tool (I''m a programmer who got roped into doing server administration several years ago and I''m learning as I go), but I did notice that this status matched the results of my manual removal of an address from the table.

Unfortunately I don''t have access to the switch or the load balancer. All of the hardware is setup and maintained by the provider, and I just have power cycling and remote access capabilities.

What would cause the address to drop out of the ARP cache and not renew itself? As expected, pinging the server restored it again. I will more than likely just add it to the table manually, but I''d like to point this out to the support staff at the provider.

Thanks again.

Interim

122

May 17, 2004 04:14 PM

Well, incomplete is a vague case. Generally it means your machine sent and ARP request, but didn''t get a reply (hence incomplete). This usually happens when you try to reach a machine that doesn''t exist. (Make a few pings to non-existant IPs, you''ll see incompletes in your ARP table).

If I had to make a guess about your problem on the little bit of info available, I would suspect the loadbalancer might have been configured lately or updated, or could have some hardware or software issue. Most load balancers have their own switch ports and essentially front your machines to a single IP address, but allow inter-communication between nodes (and allow your own non-loaded IPs through). Wouldn''t surprise me if your bug is there.

How to fix that, who knows

Check the vendor bug list, maybe put in a ticket. Since a regular ping will fix your issue, it sounds like a bug in your load balancer...or could be a faulty network (do you have any other issues? Slow connections? Lots of packet errors on your interface?) Still, if the only thing that fixes it is a reboot, then that would indicate it wasn''t just a bad network performance.

Small chance it''s a bug in your OS of choice. Maybe a quick look over the current known bugs for your OS.

You can probably avoid the reboot by just doing a "arp -d 1.2.3.4" (of course, 1.2.3.4 is yourr actual IP). And making some connection to your other server (ping, web, ssh, what not). Still, if it expires again, you''re back in the same boat.

I''d set a static arp entry. How to do this might vary depending on your OS (probably arp -s hardware ip). It''ll never ARP again for that IP address. Just keep this in mind in case you swap out network cards. Still you have to be careful with this, an arp -d -a will clear your static entry until you enter it anew (startup script, cron job, etc).

Interim.

C-Junkie

1,099

May 17, 2004 04:30 PM

do you have static IPs on the WWW servers, or is it DHCP?

SantaClaws

Author

144

May 17, 2004 07:00 PM

Just posting to update with a bit more information...

I did a little more poking around with tcpdump and what not, and apparently we're getting a flood of ARP requests hitting our server, for IPs outside our range of control (we do hold a large block of IPs, but not these). Somewhere on the level of ~50 queries per second, such as the following:

quote:
19:47:30.102787 2:e0:52:14:67:6a Broadcast arp 60: arp who-has 64.225.154.138 tell 64.225.152.6

Most of them from the 02:E0:52:*:*:* range. I looked up 02:E0:52, and it looks like you were right Interim - that mac address prefix is associated with Foundry ServerIron load balancers.

Correct me if I'm wrong, but could this flood of requests be overflowing my ARP table and causing it to drop the entries? It has happened on occasion with our other database server's ARP entry (on the same VLAN), too.

I'm going to open up a trouble ticket now that I have more of an idea of what's going on. Thanks for the information.

[edited by - SantaClaws on May 17, 2004 8:01:36 PM]

C-Junkie

1,099

May 17, 2004 07:06 PM

Nah. I get 70/sec on my cable modem.

Interim

122

May 17, 2004 07:56 PM

Hrms.

My instinct from what you said is to stick with the Load Balancer. My thoughts are that someone misconfigured it, and now you''re getting network bleed from another network. If its your Load Balancer, then you might want to find out why, but I''m guessing it''s your ISPs?

Might be as simple as someone changing your VLAN port membership (would explain the sudden heavy ARP traffic).

Or the Load Balancer might be hosting multiple logical networks and sites, and is now forwarding unknown ARPs out all ports. That would explain the heavy ARP traffic from outside your network.

This is definitely an error I believe, if it isn''t, it''s sloppy network design.

I''ve never seen an ARP overflow and I''ve run Linux and BSD boxes on larger networks. However, I must confess that I never had large collision domains (which is sound like you might have). Still, I wouldn''t expect you to have a incomplete entry, but to have a full table and you should see errors in your logs. I think Linux has a "neighbor table overflow" or something similar.

Still, you might want to lookup some information on how your particular NIX of choice handles that case, see if any symptoms are in play.

I''m guessing this is a LB that doesn''t use alias address on your server? Just re-wires the address on the fly? Could be that after a period of inactivity (why ping keeps it alive) the LB drops the packets thinking their for or from a different network. I think it''s an RFC to drop ARPs where the reply is from a different network. It could be that a reboot works since it sees the interface come up, and bypasses this algorithm. It might explain why you see other ARPs outside your logical range.

I guess one way to try this would be to wait for a period of inactivity (you get your incomplete to your gateway or the other www server), delete the incomplete ARP entry for that address, then try to ping. Get a tcpdump. I''m guessing you won''t get a reply and you won''t see an ARP reply (get a new incomplete). Then bring down and bring up your interface, see if that fixes it (might still need a reboot).

Very interesting, goofy bug I thinks. I''m still leaning on your provider having misconfigured something, but again, sort of have just what info you provided, not there myself to poke around. =) But the last bit of info is definitely a good symptom of the issue I think.

Int

SantaClaws

Author

144

May 17, 2004 09:18 PM

I''m thinking it''s the provider, too, mostly because I didn''t touch ANYTHING when this started occuring.

I was thinking the same thing about the overflow, and expected to see more data in the table if that was the case. I''m running RedHat 9, by the way. The only reason I thought it might be an overflow was the amount of incoming requests and the fact that I found another post by someone on RedHat that had the same problem, and it was indeed an overflow.

What exactly do you mean by alias address? The way they set it up was with another IP address, which we pointed all of our domains to, and it redirects as needed between the two web servers.

The thing that makes me think it''s a problem routing-wise on the load balancer is the fact that the mac address of the source of the flooding is coming from the load balancer, and the fact that none of the other machines on the VLAN are getting any such traffic.

C-Junkie: Hmm... maybe so, but the other servers aren''t getting the same traffic, so I don''t think it''s a normal situation for these.

Strange Routing Problem

This topic is closed to new replies.

Popular Topics

Recommended Tutorials

Strange Routing Problem

This topic is closed to new replies.

Popular Topics

Recommended Tutorials

Reticulating splines