IBM/Lenovo BNT G8000 – the official support response

This is a follow up post on my previous post in regards to a firmware upgrade of an IBM/Lenovo BNT G8000 gone bad.

I was hoping I was going to keep talking to Erik, because he seemed to have understood what I was looking for. Unfortunally, he was off on friday, and I ended up trying to explain this to a new person (who I will not name).

So, I’ve again been mailing back and forth with the System-X support for a day, trying to explain that I am not looking for support – but rather looking to report a severe error in their firmware for the BNT G8000 that can bring down entire networks if a certain hardware error occurs.

As a sidenote, thinking about this for a while and talking with some colleagues about it made me think that there might be a serious security flaw involved aswell. What if one would create something that mimics the behaviour of the broken switch – would one able to take any BNT based network offline?

Food for thought, but it is concerning.

Anyway, I know you want to know the official response from the IBM System-X support, and it is as follows:

 

I have talked to my managers about this issue, and it is as I informed in my previous e-mail.

Since the machine is not covered by a support agreement, you are using a firmware that you are not entitled to use. If you want further help with this, you must approve the cost suggestion for support, then we can move forward.

Please let us know how you want to proceed.

 

So yes, the support organisation of IBM/Lenovo System-X wants me to pay 553 EUR per hour to help them sort out what seems to be a severe flaw in their BNT product line.

It seems IBM might have made the decision on moving to another brand very easy, but I have not yet given up completely. It could be that I am just stuck with a support organisation following their instructions to the letter without room to handle special cases.

Got some other approaches to look into on monday to get this report handled, but the frustration of banging my head against the wall here is .. big.

Addition. Just to clarify, as some people seem to misunderstand this post for some reason: I am not trying to “open a support case”. I am not looking for support at all. The switch is broken – a new one will be bought. Shit happens. I have however tried to report a severe system flaw and potential security issue in their product, and the existance of a support agreement or “right to use the firmware” (which was a “quiet change” in IBMs terms less than a year ago, shame on me for missing it) is not relevant at all here.

Please add your comments, thoughts and questions in the comment section below – and don’t forget, you can always reach me at blog@engren.se if you want a personal contact.

 

IBM/Lenovo BNT G8000 – fimware upgrade gone wrong – part 1

blade-wops

Hello reader!

You most likely found your way here because Google didn’t show that many posts on this subject – so let’s get straight to my last experience of the IBM/Lenovo BNT G8000 switch and the IBM/Lenovo System-X support organisation.

I mean, when the IBM/Lenovo BNT G8000 is working, it’s a beast. It handles the network smooth and extremely fast – and as a network administrator, the features for sure kept me happy. In combination with the selectable CLI (either ISCLI, that is “Cisco compatible”, or “Menu”, which is basically the Nortel/Alteon switch interface <3 ) – it kept me happy, and the system administrators with their iSCSI depending systems very happy aswell.

We all slept like babies, knowing that our IBM/Lenovo BNT G8000 switches were performing extremely well, 365/24/7.

We did regular firmware upgrades to the units, following all of the v6 series of the firmware, and decided to hold back for a while when they ramped up to v7 and included MCP Linux kernel updates to the firmware.

Two days ago, it was finally time to take the step. Go from v6 firmware, up to the latest available v7 release. Obviously, alot of stuff has changed, but I went through the release notes several times and found nothing that seemed alarming or cause for massive concern. Some bug fixes, some security fixes – the usual stuff.

I started the process by prepping the switches with the new firmware as image2, and uploaded the latest boot code aswell. Smooth and straight forward, as expected based on previous experiences.

Leaned back, waiting for the service window at this particular site to start. Talking to the system administrator who wanted to be kept in the loop, joking and talking about general things.

Service window time. Woop!

Rebooted the first switch. The magical seconds that feels like minutes passed, and the switch came up again. Slighter faster reaction in the command line interface, some new options in the menu. Seemed good.

Checked in with the systems guy, and as expected, he noticed no downtime – this specific site where this set of IBM/Lenovo BNT G8000 units are located are built with redundancy and resilience in mind, with every host server connected to the two IBM/Lenovo BNT G8000 units with 2 (or more) ethernet cables.

Not one single component shall be able to have a noticable impact on anything if it goes down.

Sounds pretty normal these days, right?

I know.

I take a short break to leave the switch in place with the new firmware running, just in case something would be acting up. iSCSI traffic flowing as expected, nothing weird on interface counters anywhere. Everything seems perfectly normal.

After a while, I decided it was time to get moving with the second IBM/Lenovo BNT G8000 switch. Since everything is prepped, it’s just a matter of setting what image to use on boot – and then reset it.

Said and done.

The magical seconds pass. Some more seconds pass. Hm. What’s going on?

Oh, there. It responds to ping again! But only 8. WTF?

At this time, our system administrator is losing connectivity to the servers, and I am shortly after booted out of switch 1 that is no longer responding to ping.

Every now and then, like once per second minute, I get access to it for 15-20 seconds – then access is lost again. Switch #2 is completely unreachable.

Okey, this is bad. Really really bad. This site has iSCSI based storage for all guest servers in a virtualized enviroment, and a routine firmware upgrade just caused it to go down?!

Thoughts coming through my head right now are many, including what headaches this will give to the system administrators and the users of the services in this site.

Okey, time to get my priority straight. Get atleast one of the switches online ASAFP, and look at what happened second.

I get in touch with our hosting provider, and ask them for remote hands. As usual, they respond quickly and promptly, and went on site to power off both switches, and then just power on switch #1. After all, things were working fine with switch #1 before I updated switch #2, so it seemed like a logical choice to move forward quickly.

I get access to switch #1, and login to disable all LACP ports to switch #2. With that done, I ask the hosting provider to turn on switch #2 again, so I can connect to the switch using serial console cable and see what’s going on.

Upon connecting to the serial port, the following is flooding the serial terminal so badly that I have no real idea what is going on in the system interface.

ERROR: mp_bpdu_send_ucast failed to send packet to u=1 p=23
ERROR: mp_bpdu_send_ucast failed to send packet to u=1 p=2
ERROR: mp_bpdu_send_ucast failed to send packet to u=1 p=18
ERROR: mp_bpdu_send_ucast failed to send packet to u=1 p=1
ERROR: mp_bpdu_send_ucast failed to send packet to u=1 p=4
ERROR: mp_bpdu_send_ucast failed to send packet to u=1 p=6
ERROR: mp_bpdu_send_ucast failed to send packet to u=1 p=15
ERROR: mp_bpdu_send_ucast failed to send packet to u=1 p=16
ERROR: mp_bpdu_send_ucast failed to send packet to u=1 p=17
ERROR: mp_bpdu_send_ucast failed to send packet to u=1 p=19
ERROR: mp_bpdu_send_ucast failed to send packet to u=1 p=23
ERROR: mp_bpdu_send_ucast failed to send packet to u=1 p=2
ERROR: mp_bpdu_send_ucast failed to send packet to u=1 p=18
ERROR: mp_bpdu_send_ucast failed to send packet to u=1 p=1
ERROR: mp_bpdu_send_ucast failed to send packet to u=1 p=4
ERROR: mp_bpdu_send_ucast failed to send packet to u=1 p=6
ERROR: mp_bpdu_send_ucast failed to send packet to u=1 p=15
ERROR: mp_bpdu_send_ucast failed to send packet to u=1 p=16
ERROR: mp_bpdu_send_ucast failed to send packet to u=1 p=17
ERROR: mp_bpdu_send_ucast failed to send packet to u=1 p=19
ERROR: mp_bpdu_send_ucast failed to send packet to u=1 p=23
ERROR: mp_bpdu_send_ucast failed to send packet to u=1 p=2

Ok, this is obviously STP … that can’t send packets to machines connected to itself? Okey, that’s weird… but hey, let’s get rid of the logging.

/cfg/sys/syslog/log all dis
/cfg/sys/syslog/console dis
apply

Pasted the above once I figured out that I was at a password prompt and managed to login. No more logging should be presented to the screen now, and yet, I am still flooded with the above error messages. Guess it’s handled outside the ordinary logging routines. This alone made me very curious, and worried – errors bypassing the logging routines are usually not common at all and basically debug code left behind by the developers.

But, okey. It’s STP. If I kill off STP completely, this should no longer happen.

/cfg/l2/stp off
apply

Yep. Error messages stop. At this time, I’m slowly starting to realize that this switch is broken in a way that is really catastrophic. If STP has gone b0nkers on this switch, it has likely sent invalid STP data to the firewalls and to the neighbouring switches – which could explain the lost connectivity to neighbouring switches and weird behaviour in the entire enviroment.

I go through some of the logs, and realize that STP has been killing off every single port, and enabling them again – but what caught my eye made me start suspecting an actual hardware failure :

?????? 7:21:27 COMPANY-SW2 NOTICE server: link up on port 3

This should be compared with the log entries on switch #1 :

Mar 25 7:22:01 COMPANY-SW1 NOTICE link: link up on port 43

wtf5

Yeah, something really has gone wrong with the firmware upgrade, and it sortof smells like a hardware failure. As I’m sitting here thinking just about what I have just seen happening, another error message presents itself – again, like the BPDU messages, completely outside the scope of the normal logging routines:

UNIT 0 ERROR interrupt: 4 PCI Fatal Errors on Memory read for TX

Right. Fatal memory read error.

Now, these units are not under a support agreement or hardware warranty agreement, so in a normal case, it’s a matter of “let me just order a new one” – but the effects by this error are so extremely severe that I decide to report this to IBM/Lenovo System-X either way.

I got in touch with an awesome support guy named Erik, who completley understood my situation and realized that I wasn’t really looking to get “support”.

I wanted to report a severe software issue, both in regards to being able to get brought down by an erroring neighbour, aswell as in regards to hardware checks and how to handle situations in regards of a severe hardware failure.

Now, I got a ticket number just today, and Erik just recently got my massive reply to his questions which I guess he’ll be looking into tomorrow – but I am expecting a call from somebody that sees the severity in this situation and wants to accept my error report.

Heck, I’ll even send them this broken IBM/Lenovo BNT G8000 switch to be able to investigate this properly – NO enviroment should ever be able to get taken offline like this because of ONE component failure .. and still, this happened.

And yes, this means that firmware updates on the other units on other sites are currently halted and will not proceed until this has been sorted by IBM/Lenovo.

Oh, right, almost forgot – the automatic IBM/Lenovo support system has sent me a mail with a price suggestion on support for equipment without a support agreement – 553 EUR per hour, minimum billable time 2h.

Funny. There will be no 553 EUR per hour paid to be allowed to help IBM/Lenovo System-X fix a severe system flaw in their BNT product line – but it will be considered, if the 2 hours would result in a replacement IBM/Lenovo BNT G8000 ending up in the datacenter – but it’s more tempting to go look at another brand and replace all IBM/Lenovo BNT G8000 at the other sites if this report isn’t handled properly by IBM/Lenovo.

To be continued …

Additional comment: The System-X support has informed me that they updated the terms for downloading and using firmware less than a year ago, to require a support contract to be “allowed” to use it. Oups! OK, that’s a separate issue to look into. However, it’s important to note that I am not looking for support here. I do not want to open a support case at all. I want to report a fatal flaw in their software – and this flaw exists, no matter if there is a support agreement in place or not…

2015-03-28 – I’ve gotten a response. Read the follow up post here.

 

Some pfSense commands to keep handy!

721px-Pfs-logo-vector.svg

 

command description
pfctl -d Deactivate the pf packet filter – disables all fw functions
pfctl -e Activate the pf packet filter – enables all fw functions
pfctl -sn Shows current NAT rules
pfctl -sr Shows current filter rules
pfctl -ss Shows the current state table
pfctl -sa Show as much as possible.
viconfig Manually edit the configuration in /conf/config.xml. Once file has been saved and editor exited, the /tmp/config.cache is removed so the next config reload event will load config.xml, not the cached version. You could run the next command to trigger an instant reload.
/etc/rc.reload_all Reload the Firewall with all the configuration. This also restarts the webgui and sshd – but keeps the current ssh sessions active just as a regular sshd restart.
 
 

Strong two-factor authentication

The YubiKey® is the leading one time password token for simple, open online identity protection. It can be configured for several modes, including OATH, and is also available with an integrated RFID chip for combined digital and physical access.

Yubico has shipped a million YubiKeys to more than 18,000 customers in 100 countries and in all continents. YubiKey users range from independent developers to e-governments, universities and global enterprises.

It’s ideal for any organisation that are looking at increasing their security on the digital side. The one time password (OTP) configuration of these keys are brilliant, where you at the end of the life span (99999 OTPs generated) have to reprogram the key if you want to be able to get into the systems once the OTPs has been exhausted, or better yet, trash the key and program a brand new one. The physical key has likely seen many situations that have shortened the physical timespan, and the cost of these tokens are not really that high.

Security policy often dictates that one should not leave these tokens behind, but if you are not using it for anything than authentification – you’re not really going to bring it anywhere. I know I wouldn’t, I’d leave it hooked up to my laptop even if I’m not there, basically leaving the entire office network, administrative systems etc, wide open for unauthorized use.

Using the RFID version of the Yubikey for physical entreance to the office, server rooms and other areas of your company, you also ensure that your employees will have a healthy habit of always removing the Yubikey from the computer and bringing it with them.

I’ve been involved in a project to automatically program these units with a static password for use with, for example, Truecrypt encrypted laptops. Using your own password directly followed by a keypress on the Yubikey – and you have a two factor auth for even booting into the company laptop.

The goal for this project is to automate as much of the key configuration as possible, where I’ve written the actual system script to do the logic for the key – gather personal details, generating a unique AES key and unique UID used to setup the OTP in combination with LDAP/Radius/Kerberos (for VPN and Windows Samba/Linux PAM logins), and fetching necessary details from the key using a homebrewed C program talking with the Yubikey to be able to uniquely identify it in case it is ever dropped and you want to identify it.