Is it possible that when the control board stops normal operation (or has a kernel panic) and then reboots itself, the power supply to the Blade 3 nodes is also cut? I’ve found out that every time the control board crashes (you can hear the fans stop whirring, then restart after a couple of seconds, then stop, and then start again for good) and the uptime command on the control board gives an uptime of only a couple of minutes, the nodes not only go offline, but rather also report only a couple of mionutes of uptime. Once I’ve even seen the power lights of the nodes go out for several seconds.
And the power for the nodes runs through the DC/DC converter on the control board.
So: Does an unwanted shutdown or crash of the control board lead to a power outage affecting the nodes? If this is true, it’s a design flaw you should really address because such a cold restart can lead to data loss. Once it even destroyed my Ceph cluster.