Internal network breaks upon reboot of a single node

ProfessorFate · November 30, 2025, 7:16pm

I’m now having this issue for the second time: When I reboot one of the nodes (this time it was necessary), the internal PCIe network is gone. The only error message I get on the control board is from MIOP complaining about a full queue:

[ 1635.528174] miop 0000:06:00.0: TX[0]: Queue is full.

The four nodes keep on working, but are no longer attached to the network. I can’t even ping them from the control board. Neither does any of the nodes get a DHCP lease from the control board:

mixtile@blade3n4:~$ ip addr show
[…]
5: pci0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 65202 qdisc fq_codel state UP group default qlen 1000
    link/ether 02:a4:f1:25:38:0a brd ff:ff:ff:ff:ff:ff
    inet6 fe80::1c52:a8ae:6089:c6a6/64 scope link noprefixroute 
       valid_lft forever preferred_lft forever

The only “remedy” (rather: workaround) is to reboot the whole cluster together with all nodes.

Buyuliang · December 8, 2025, 1:48am

Previously, the driver did not support hot-swapping, which is why this issue occurred

ProfessorFate · December 8, 2025, 2:40pm

Well, it’s not yet “hot swapping” when one node goes off and the whole network breaks down. Please address this bug when developing the next version of MIOP. Thank you.