I’m now having this issue for the second time: When I reboot one of the nodes (this time it was necessary), the internal PCIe network is gone. The only error message I get on the control board is from MIOP complaining about a full queue:
[ 1635.528174] miop 0000:06:00.0: TX[0]: Queue is full.
The four nodes keep on working, but are no longer attached to the network. I can’t even ping them from the control board. Neither does any of the nodes get a DHCP lease from the control board:
mixtile@blade3n4:~$ ip addr show
[…]
5: pci0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 65202 qdisc fq_codel state UP group default qlen 1000
link/ether 02:a4:f1:25:38:0a brd ff:ff:ff:ff:ff:ff
inet6 fe80::1c52:a8ae:6089:c6a6/64 scope link noprefixroute
valid_lft forever preferred_lft forever
The only “remedy” (rather: workaround) is to reboot the whole cluster together with all nodes.