Power cut when control board crashes?

Is it possible that when the control board stops normal operation (or has a kernel panic) and then reboots itself, the power supply to the Blade 3 nodes is also cut? I’ve found out that every time the control board crashes (you can hear the fans stop whirring, then restart after a couple of seconds, then stop, and then start again for good) and the uptime command on the control board gives an uptime of only a couple of minutes, the nodes not only go offline, but rather also report only a couple of mionutes of uptime. Once I’ve even seen the power lights of the nodes go out for several seconds.

And the power for the nodes runs through the DC/DC converter on the control board.

So: Does an unwanted shutdown or crash of the control board lead to a power outage affecting the nodes? If this is true, it’s a design flaw you should really address because such a cold restart can lead to data loss. Once it even destroyed my Ceph cluster.

Yes, when the control board executes poweroff, it cuts the power to the Blade 3 nodes. Even using nodectl results in a power shutdown. This is a known issue, and it will be fixed in the next version.

As a workaround, you can shutdown the Blade 3 nodes via SSH from the control board, and replace or modify the poweroff command to avoid cutting power abruptly.

For example, the poweroff script can be written as:

ssh mixtile@10.20.0.2 sudo poweroff
ssh mixtile@10.20.0.3 sudo poweroff
ssh mixtile@10.20.0.4 sudo poweroff
ssh mixtile@10.20.0.5 sudo poweroff
sleep 5
sudo poweroff

This ensures that each Blade 3 node is properly shut down before the control board powers off itself.

You misunderstood me: The regular shutdown process was not what I was talking about! I mean an unwanted poweroff resp. a crash of the control board, which apparently always leads to an immediate power cut on the electric rail leading to the nodes, therefore switching them off the “hard” way. On my cluster, this has already led to a corrputed Ceph (with 1.3 TB of valuable data) installation. This is a serious fault you should address when developing the next version of the control board and of its firmware.

This is a hardware-level design issue. Blade3’s power-on depends on the control board, so when the control board crashes, it causes all devices to go offline simultaneously.
As a Ceph user, it is recommended to run scripts that synchronize data to the filesystem every few minutes or during system idle periods.

You inserted another single point of failure into the system by letting power delivery get cut every time the control board crashes. Not good. :frowning: Please address this issue when developing a new version.