As several other users, I’ve got the issue that the control board of the Cluster Box sometimes crashes. When I’m logged in via ssh, I get such an error message:
client_loop: send disconnect: Broken pipe
On the control board, I get a whole bunch of the messages I’ve written you about several times:
[ 292.012003] miop 0000:03:00.0: DMA timeout, restart DMA controller.
I also always get this message when connected to the Blade 3 nodes when the control board crashes. When I can log into the control board and the cluster nodes after a couple of minutes, uptime shows me an uptime of a couple of minutes both on the control board, and on all four nodes:
mixtile@ClusterBox:~$ uptime
17:11:05 up 7 min, load average: 5.91, 7.30, 3.59
This means that a crashing control board takes the nodes with it, probably by cutting off power. Is that true? If yes, please tell me ehat I can do to make sure that such a power cut won’t occur. Last time it happened while I was building a Ceph cluster and left me with an unfinished installation. And: Are data exchanged between nodes (which can be several terabytes) forwarded to the control board?
Thank you.