Network Event Status

If the info has spelling and grammer errors, please remember that network engineers post here, live, without any executive review.

2/17/2012 1:38PM We just disabled a BGP session with one major transit provider in San Jose. This one has been tricky , happens all of a sudden at random times - sometimes remaining up for 8 hours others 5 hours or 3 hours, etc...We believe it is due to an incompatibility problem between router brands...the configurations need to be manually tuned on that providers end so that their router will stop resetting the BGP session.

2/16/2012 about 11:50...around 15-20 mins ago , we had a large transit provider drop its circuit to us. Over 300,000 prefixes were auto removed from the San Jose core routers tables and replaced by other prefixes. Meaning a BGP flap occurred. The prefixes have been restored and we are investigating. 

2/13/2012 Errors logged on San Jose Market Post Tower Core Router at 5:17PM . CPU failed over to backup CPU, blips during failover and some routes lost. Tech arrived saw everything running , at 7PM decided to remove bad CPU. Put in spare CPU. Required reboot of core. Down time was 5 mins total - however, BGP session needed to be restored and expected BGP spottiness was unavoidable. Some Internet sites went and came back during this time. All appears stable. Core again has dual CPUs.

1/10/2012 2:39 PM San Jose - core router reboot. Emergency. We have noticed some odd routing issues to properly clear and reset tables we must reboot. Could result in some BGP Flapping for upto 20 minutes. Some sites work others do not...then settles down and all work.

10/21/2011 7 AM San Jose - All Clear - As you know we have been monitoring carefully, with techs and spares standing by day & night.  It has been more than 16 hours since an incident occurred. Looks like the tech team may have resolved this very intermittent hardware problem yesterday afternoon.

10/20/2011 3:45 PM San Jose - we have had a tech with spares all day on site...waiting for intermittent issue to arise. It happened again. It is hard to track these down when they dont remain hard failure. But we have multiple spares and the tech in the datacenter all day. Rest assured that, if this persists we will begin moving C.O. paths and rerouting customers to another core router.

10/20/2011 San Jose core router, 5 min outage plus BGP flap on some neighbors during repair. Maybe, we replaced the wrong processor card yesterday (there are 2- master & slave on the same buss)...Some customer experienced slowness and dropped packets...our continued monitoring and testing revealed we should replace the second processor too , before a hard failure occurred.  Tech is standing by while we monitor.

10/19/2011 Slow repair (up/down during 1 hour) to minimize impact. Event is over - Memory issue reappeared in San Jose. Impact was multiple momentary up-downs and BGP flaps in San Jose due to the attempt to narrow down the memory issue using the spares. Default routes to other core router paths were used to help minimize impact during the transitions. Due to yesterday's event we had spares prepared with the correct code images and config files for this core router and San Jose is now operating on on new processor card with the original slave card in place.

10/18/2011 5:25 PM - Event is over- minor impact - Less than 3 mins total. San Jose Core Router had memory issue due to older code version and the fact it had not been rebooted in more than 2 years. This required an emergency reboot. With tech standing in front of unit , we rebooted it. Down time 5 mins. Customers routing through San Jose Data Center were effected. 

6/15/2011 9:30PM-10:30PM Los Angeles Area - Intermittent controller module problem in main core router. This has dual CPU blades. Master had flapping issues and had to be pulled. It would go bad, watchdog flipped slave to master...then watchdog would find the original master to be good again and flip back, then that blade would fail again .... Looping. Pulled bad CPU module. Problem resolved. New Module will be installed this week and should not cause any issue when inserted. We now suspect this caused the 6/1/2011 incident and it took this long to fail harder.

6/1/2011 10:30PM-11:30PM  Los Angeles Area - Event is Over. Power Surge in one of our racks at One Wilshire caused memory leak in core router. Most customers in Southern California effected. No events logged in any routers. Very unusual. All blades pass diagnostics. Event is over.










Fiber Internet Center LLC          SF 650-330-0428      LA 213-342-2878      OC 714-619-0146      SD 619-398-0268       copyright 2011      Site Map