IBM/Lenovo BNT G8000 – fimware upgrade gone wrong – part 1

blade-wops

Hello reader!

You most likely found your way here because Google didn’t show that many posts on this subject – so let’s get straight to my last experience of the IBM/Lenovo BNT G8000 switch and the IBM/Lenovo System-X support organisation.

I mean, when the IBM/Lenovo BNT G8000 is working, it’s a beast. It handles the network smooth and extremely fast – and as a network administrator, the features for sure kept me happy. In combination with the selectable CLI (either ISCLI, that is “Cisco compatible”, or “Menu”, which is basically the Nortel/Alteon switch interface <3 ) – it kept me happy, and the system administrators with their iSCSI depending systems very happy aswell.

We all slept like babies, knowing that our IBM/Lenovo BNT G8000 switches were performing extremely well, 365/24/7.

We did regular firmware upgrades to the units, following all of the v6 series of the firmware, and decided to hold back for a while when they ramped up to v7 and included MCP Linux kernel updates to the firmware.

Two days ago, it was finally time to take the step. Go from v6 firmware, up to the latest available v7 release. Obviously, alot of stuff has changed, but I went through the release notes several times and found nothing that seemed alarming or cause for massive concern. Some bug fixes, some security fixes – the usual stuff.

I started the process by prepping the switches with the new firmware as image2, and uploaded the latest boot code aswell. Smooth and straight forward, as expected based on previous experiences.

Leaned back, waiting for the service window at this particular site to start. Talking to the system administrator who wanted to be kept in the loop, joking and talking about general things.

Service window time. Woop!

Rebooted the first switch. The magical seconds that feels like minutes passed, and the switch came up again. Slighter faster reaction in the command line interface, some new options in the menu. Seemed good.

Checked in with the systems guy, and as expected, he noticed no downtime – this specific site where this set of IBM/Lenovo BNT G8000 units are located are built with redundancy and resilience in mind, with every host server connected to the two IBM/Lenovo BNT G8000 units with 2 (or more) ethernet cables.

Not one single component shall be able to have a noticable impact on anything if it goes down.

Sounds pretty normal these days, right?

I know.

I take a short break to leave the switch in place with the new firmware running, just in case something would be acting up. iSCSI traffic flowing as expected, nothing weird on interface counters anywhere. Everything seems perfectly normal.

After a while, I decided it was time to get moving with the second IBM/Lenovo BNT G8000 switch. Since everything is prepped, it’s just a matter of setting what image to use on boot – and then reset it.

Said and done.

The magical seconds pass. Some more seconds pass. Hm. What’s going on?

Oh, there. It responds to ping again! But only 8. WTF?

At this time, our system administrator is losing connectivity to the servers, and I am shortly after booted out of switch 1 that is no longer responding to ping.

Every now and then, like once per second minute, I get access to it for 15-20 seconds – then access is lost again. Switch #2 is completely unreachable.

Okey, this is bad. Really really bad. This site has iSCSI based storage for all guest servers in a virtualized enviroment, and a routine firmware upgrade just caused it to go down?!

Thoughts coming through my head right now are many, including what headaches this will give to the system administrators and the users of the services in this site.

Okey, time to get my priority straight. Get atleast one of the switches online ASAFP, and look at what happened second.

I get in touch with our hosting provider, and ask them for remote hands. As usual, they respond quickly and promptly, and went on site to power off both switches, and then just power on switch #1. After all, things were working fine with switch #1 before I updated switch #2, so it seemed like a logical choice to move forward quickly.

I get access to switch #1, and login to disable all LACP ports to switch #2. With that done, I ask the hosting provider to turn on switch #2 again, so I can connect to the switch using serial console cable and see what’s going on.

Upon connecting to the serial port, the following is flooding the serial terminal so badly that I have no real idea what is going on in the system interface.

ERROR: mp_bpdu_send_ucast failed to send packet to u=1 p=23
ERROR: mp_bpdu_send_ucast failed to send packet to u=1 p=2
ERROR: mp_bpdu_send_ucast failed to send packet to u=1 p=18
ERROR: mp_bpdu_send_ucast failed to send packet to u=1 p=1
ERROR: mp_bpdu_send_ucast failed to send packet to u=1 p=4
ERROR: mp_bpdu_send_ucast failed to send packet to u=1 p=6
ERROR: mp_bpdu_send_ucast failed to send packet to u=1 p=15
ERROR: mp_bpdu_send_ucast failed to send packet to u=1 p=16
ERROR: mp_bpdu_send_ucast failed to send packet to u=1 p=17
ERROR: mp_bpdu_send_ucast failed to send packet to u=1 p=19
ERROR: mp_bpdu_send_ucast failed to send packet to u=1 p=23
ERROR: mp_bpdu_send_ucast failed to send packet to u=1 p=2
ERROR: mp_bpdu_send_ucast failed to send packet to u=1 p=18
ERROR: mp_bpdu_send_ucast failed to send packet to u=1 p=1
ERROR: mp_bpdu_send_ucast failed to send packet to u=1 p=4
ERROR: mp_bpdu_send_ucast failed to send packet to u=1 p=6
ERROR: mp_bpdu_send_ucast failed to send packet to u=1 p=15
ERROR: mp_bpdu_send_ucast failed to send packet to u=1 p=16
ERROR: mp_bpdu_send_ucast failed to send packet to u=1 p=17
ERROR: mp_bpdu_send_ucast failed to send packet to u=1 p=19
ERROR: mp_bpdu_send_ucast failed to send packet to u=1 p=23
ERROR: mp_bpdu_send_ucast failed to send packet to u=1 p=2

Ok, this is obviously STP … that can’t send packets to machines connected to itself? Okey, that’s weird… but hey, let’s get rid of the logging.

/cfg/sys/syslog/log all dis
/cfg/sys/syslog/console dis
apply

Pasted the above once I figured out that I was at a password prompt and managed to login. No more logging should be presented to the screen now, and yet, I am still flooded with the above error messages. Guess it’s handled outside the ordinary logging routines. This alone made me very curious, and worried – errors bypassing the logging routines are usually not common at all and basically debug code left behind by the developers.

But, okey. It’s STP. If I kill off STP completely, this should no longer happen.

/cfg/l2/stp off
apply

Yep. Error messages stop. At this time, I’m slowly starting to realize that this switch is broken in a way that is really catastrophic. If STP has gone b0nkers on this switch, it has likely sent invalid STP data to the firewalls and to the neighbouring switches – which could explain the lost connectivity to neighbouring switches and weird behaviour in the entire enviroment.

I go through some of the logs, and realize that STP has been killing off every single port, and enabling them again – but what caught my eye made me start suspecting an actual hardware failure :

?????? 7:21:27 COMPANY-SW2 NOTICE server: link up on port 3

This should be compared with the log entries on switch #1 :

Mar 25 7:22:01 COMPANY-SW1 NOTICE link: link up on port 43

wtf5

Yeah, something really has gone wrong with the firmware upgrade, and it sortof smells like a hardware failure. As I’m sitting here thinking just about what I have just seen happening, another error message presents itself – again, like the BPDU messages, completely outside the scope of the normal logging routines:

UNIT 0 ERROR interrupt: 4 PCI Fatal Errors on Memory read for TX

Right. Fatal memory read error.

Now, these units are not under a support agreement or hardware warranty agreement, so in a normal case, it’s a matter of “let me just order a new one” – but the effects by this error are so extremely severe that I decide to report this to IBM/Lenovo System-X either way.

I got in touch with an awesome support guy named Erik, who completley understood my situation and realized that I wasn’t really looking to get “support”.

I wanted to report a severe software issue, both in regards to being able to get brought down by an erroring neighbour, aswell as in regards to hardware checks and how to handle situations in regards of a severe hardware failure.

Now, I got a ticket number just today, and Erik just recently got my massive reply to his questions which I guess he’ll be looking into tomorrow – but I am expecting a call from somebody that sees the severity in this situation and wants to accept my error report.

Heck, I’ll even send them this broken IBM/Lenovo BNT G8000 switch to be able to investigate this properly – NO enviroment should ever be able to get taken offline like this because of ONE component failure .. and still, this happened.

And yes, this means that firmware updates on the other units on other sites are currently halted and will not proceed until this has been sorted by IBM/Lenovo.

Oh, right, almost forgot – the automatic IBM/Lenovo support system has sent me a mail with a price suggestion on support for equipment without a support agreement – 553 EUR per hour, minimum billable time 2h.

Funny. There will be no 553 EUR per hour paid to be allowed to help IBM/Lenovo System-X fix a severe system flaw in their BNT product line – but it will be considered, if the 2 hours would result in a replacement IBM/Lenovo BNT G8000 ending up in the datacenter – but it’s more tempting to go look at another brand and replace all IBM/Lenovo BNT G8000 at the other sites if this report isn’t handled properly by IBM/Lenovo.

To be continued …

Additional comment: The System-X support has informed me that they updated the terms for downloading and using firmware less than a year ago, to require a support contract to be “allowed” to use it. Oups! OK, that’s a separate issue to look into. However, it’s important to note that I am not looking for support here. I do not want to open a support case at all. I want to report a fatal flaw in their software – and this flaw exists, no matter if there is a support agreement in place or not…

2015-03-28 – I’ve gotten a response. Read the follow up post here.

 

Tribute to the Amiga – and its’ creators.

Animated Amiga Tribute by Erik Schwartz

We must not forget these people. They deserve a place in the history books.

David Morse, Amiga co-founder. (1943 – 2007)
http://technosoc.blogspot.com/2007/11/david-morse-amiga-co-founder-died.html

Jim Butterfield (1936 – 2007)
http://www.miketodd.net/other/jim/index.htm

Jay Miner (1932 – 1994)
http://en.wikipedia.org/wiki/Jay_Miner

Fred Fish (1952 – 2007)
http://fish.back2roots.org/

Rest in peace.