[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: my multipath routing questions... SOLVED!



On Thu, Dec 08, 2005 at 02:14:45PM -0700, andrew fresh wrote:
> On Fri, Dec 02, 2005 at 04:08:13PM -0700, andrew fresh wrote:
> > I am getting 3 different DDB's.  Mostly "kernel: page fault trap,
> > code=0" and "Panic: rtfree 2".  I have also gotten some "Panic: sbdrop",
> > but not since I got the serial console attached.  When I got the sbdrop,
> > trace showed calls to pf_* but I did not write it down as I thought I
> > would see it again with the
> > serial console.
> > 
> > It seems to DDB anywhere from 5 minutes to 90 minutes after a reboot.
> > Once I got 6.5 hours, but mostly closer to 10 minutes.  The only thing
> > that seems to make a difference is disabling pf, I am up 17.5 hours now
> > with pf disabled.
> > 
> > DMESG and the trace/ps from the DDBs are below.
> 
> They are actually available in the archives so as not to waste
> bandwidth.
> http://marc.theaimsgroup.com/?l=openbsd-misc&m=113356535818065&w=2

the whole thread is here:
http://marc.theaimsgroup.com/?t=113333257900001&r=1&w=2

> > > > or something with 'route-to' in pf?
> 
> It appears that it is the route-to that is causing it to crash.  

I believe my router has been crashing because I was generating routing
loops the way I was using route-to.

It appears after a route-to, the packet then gets re-evaluated by
additional rules including additional route-to rules (as it probably
should).

If I have this rule
pass out on { san0, san1, san2, san3 } route-to { 
  (san0, 10.0.0.1), (san1, 10.1.1.1), 
  (san2, 10.2.2.1), (san3, 10.3.3.1) 
} round-robin

If san0 is the default route that the kernel picks (no kernel
multipath), I think it does something like this:

First packet hits san0 and gets routed out san0.

Second packet hits san0 and gets routed to san1, then san0, then san2,
then san0, then san3, then san0, and out san0.

Third packet hits san0 and gets routed to san1, and out san1.

Fourth packet hits san0 and gets routed to san2, then san1, then san2,
and out san2

Fifth packet kits san0 and gets routed to san3 then san2, then san3, and
out san3.

Sixth packet hits san0 and gets routed out san0.

Seventh packet hits san0 and gets routed to san1, then san2, then san1,
then san3, then san0, then san2, and out san2.

At some point, the loop becomes long enough to cause ddbs.  With
multiple packets at once, the round robining may be able to get the
loops even longer.  

I don't know what the proper fix for this would be if anything, but
something that says "Rule X has already rerouted this packet, there may
be a loop somewhere" error message would be nicer than a page fault, or
rtfree 2 ddb.

I could also be completely wrong as to the cause of the crashes, but
this seems to be a fairly good guess.

I resolved the crashing by adding some tagging smarts to the rule:
pass out on { san0, san1, san2, san3 } route-to { 
  (san0, 10.0.0.1), (san1, 10.1.1.1), 
  (san2, 10.2.2.1), (san3, 10.3.3.1) 
} round-robin tag ROUTED ! tagged ROUTED

This has so far made the load balancing work very well, and it has gone 
for over 48 hours and not DDB'd yet.

l8rZ,
-- 
andrew - ICQ# 253198 - JID: afresh1_(_at_)_jabber_(_dot_)_org
     Proud member: http://www.mad-techies.org

BOFH excuse of the day: Dyslexics retyping hosts file on servers