Network errors when booting up a Mixtile Blade 3 as cluster node

I’m experiencing severe problems with the internal network of the Cluster Box: Accessing the external network from the control board typically works fine, but when trying to get into the external network from a cluster node (Blade 3; no matter, which one), I get hangups when downloading > 1 MB of data. Now I rebooted one of the nodes via control board, logged into the node, and got a bunch of error messages, which seem to be related to a networking issue:

root@blade3n1:~# dmesg | tail -35
[   57.720852] r8169 0002:24:00.0: Unable to load firmware rtl_nic/rtl8125b-2.fw (-2)
[   57.747464] RTL8226B_RTL8221B 2.5Gbps PHY r8169-2-2400:00: attached PHY driver [RTL8226B_RTL8221B 2.5Gbps PHY] (mii_bus:phy_addr=r8169-2-2400:00, irq=IGNORE)
[   57.867650] r8169 0002:24:00.0 enP2p36s0: Link is Down
[   58.228952] usb_gadget_probe_driver udc_name=fc000000.usb, dev_name=fc000000.usb
[   58.360878] android_work: did not send uevent (0 0 0000000000000000)
[   58.517711] mali fb000000.gpu: Loading Mali firmware 0x1010000
[   58.518162] mali fb000000.gpu: Protected memory allocator not found, Firmware protected mode entry will not be supported
[   58.518174] mali fb000000.gpu: Protected memory allocator not found, Firmware protected mode entry will not be supported
[   58.518181] mali fb000000.gpu: Protected memory allocator not found, Firmware protected mode entry will not be supported
[   60.216618] systemd-journald[281]: File /var/log/journal/24b0e758b78949088e06ee4bfcbcef83/user-1000.journal corrupted or uncleanly shut down, renaming and replacing.
[   62.425928] ttyFIQ ttyFIQ0: tty_port_close_start: tty->count = 1 port count = 2
[   65.120067] rockchip-csi2-dphy0: No link between dphy and sensor
[   65.120090] rkcif-mipi-lvds2: rkcif_update_sensor_info: stream[0] get remote terminal sensor failed!
[   65.120094] stream_cif_mipi_id0: update sensor info failed -19
[   65.120255] rockchip-csi2-dphy0: No link between dphy and sensor
[   65.120268] rkcif-mipi-lvds2: rkcif_update_sensor_info: stream[1] get remote terminal sensor failed!
[   65.120271] stream_cif_mipi_id1: update sensor info failed -19
[   65.120339] rockchip-csi2-dphy0: No link between dphy and sensor
[   65.120345] rkcif-mipi-lvds2: rkcif_update_sensor_info: stream[2] get remote terminal sensor failed!
[   65.120347] stream_cif_mipi_id2: update sensor info failed -19
[   65.121224] rockchip-csi2-dphy0: No link between dphy and sensor
[   65.121256] rkcif-mipi-lvds2: rkcif_update_sensor_info: stream[3] get remote terminal sensor failed!
[   65.121267] stream_cif_mipi_id3: update sensor info failed -19
[   65.121915] rockchip-csi2-dphy0: No link between dphy and sensor
[   65.121936] rkcif-mipi-lvds2: rkcif_update_sensor_info: stream[0] get remote terminal sensor failed!
[   65.121943] rkcif_scale_ch0: update sensor info failed -19
[   65.123307] rockchip-csi2-dphy0: No link between dphy and sensor
[   65.123328] rkcif-mipi-lvds2: rkcif_update_sensor_info: stream[1] get remote terminal sensor failed!
[   65.123345] rkcif_scale_ch1: update sensor info failed -19
[   65.123553] rockchip-csi2-dphy0: No link between dphy and sensor
[   65.123570] rkcif-mipi-lvds2: rkcif_update_sensor_info: stream[2] get remote terminal sensor failed!
[   65.123577] rkcif_scale_ch2: update sensor info failed -19
[   65.124184] rockchip-csi2-dphy0: No link between dphy and sensor
[   65.124200] rkcif-mipi-lvds2: rkcif_update_sensor_info: stream[3] get remote terminal sensor failed!
[   65.124209] rkcif_scale_ch3: update sensor info failed -19

What’s wrong here? The MIOP firmware version miop-control-blade3-arm64-v0.0.3-20240523 installed on the nodes is already up-to-date, and I am using the revised version of the control board (with the Oculink connectors at the back).

UPDATE: On another node, I even get another PCIe-related error message afterwards:

root@blade3n4:~# dmesg | tail -35
[   64.577332] rockchip-csi2-dphy0: No link between dphy and sensor
[   64.577337] rkcif-mipi-lvds2: rkcif_update_sensor_info: stream[3] get remote terminal sensor failed!
[   64.577340] stream_cif_mipi_id3: update sensor info failed -19
[   64.577543] rockchip-csi2-dphy0: No link between dphy and sensor
[   64.577551] rkcif-mipi-lvds2: rkcif_update_sensor_info: stream[0] get remote terminal sensor failed!
[   64.577554] rkcif_scale_ch0: update sensor info failed -19
[   64.577632] rockchip-csi2-dphy0: No link between dphy and sensor
[   64.577638] rkcif-mipi-lvds2: rkcif_update_sensor_info: stream[1] get remote terminal sensor failed!
[   64.577640] rkcif_scale_ch1: update sensor info failed -19
[   64.577735] rockchip-csi2-dphy0: No link between dphy and sensor
[   64.578001] rkcif-mipi-lvds2: rkcif_update_sensor_info: stream[2] get remote terminal sensor failed!
[   64.578005] rkcif_scale_ch2: update sensor info failed -19
[   64.578110] rockchip-csi2-dphy0: No link between dphy and sensor
[   64.578116] rkcif-mipi-lvds2: rkcif_update_sensor_info: stream[3] get remote terminal sensor failed!
[   64.578118] rkcif_scale_ch3: update sensor info failed -19
[   71.867113] miop-ep fe150000.pcie: TX[0]: Queue is full.
[   71.867175] miop-ep fe150000.pcie: TX[1]: Queue is full.
[   88.938474] miop-ep fe150000.pcie: TX[0]: Queue is full.
[   88.938535] miop-ep fe150000.pcie: TX[1]: Queue is full.
[  121.786155] miop-ep fe150000.pcie: TX[0]: Queue is full.
[  121.786217] miop-ep fe150000.pcie: TX[1]: Queue is full.
[  190.392068] miop-ep fe150000.pcie: TX[0]: Queue is full.
[  190.392130] miop-ep fe150000.pcie: TX[1]: Queue is full.
[  319.472388] miop-ep fe150000.pcie: TX[0]: Queue is full.
[  319.472451] miop-ep fe150000.pcie: TX[1]: Queue is full.
[  329.097643] miop-ep fe150000.pcie: TX[0]: Queue is full.
[  329.097704] miop-ep fe150000.pcie: TX[1]: Queue is full.
[  350.139620] ttyFIQ ttyFIQ0: tty_port_close_start: tty->count = 1 port count = 3
[  350.485478] ttyFIQ ttyFIQ0: tty_port_close_start: tty->count = 1 port count = 2
[  609.315517] miop-ep fe150000.pcie: TX[0]: Queue is full.
[  609.315578] miop-ep fe150000.pcie: TX[1]: Queue is full.
[ 1153.024487] miop-ep fe150000.pcie: TX[0]: Queue is full.
[ 1153.024548] miop-ep fe150000.pcie: TX[1]: Queue is full.
[ 2233.728447] miop-ep fe150000.pcie: TX[0]: Queue is full.
[ 2233.728508] miop-ep fe150000.pcie: TX[1]: Queue is full.

Dear Jacek,
From the description of the device phenomenon, I know that when you restart one of blade3, other devices are abnormal. Since the device does not support hotplugging at present, the restart of a single block device will affect the normal use of other devices. The correct way is to restart the whole device, and we are also trying to optimize this part, so as to provide the best use experience.

I get the same trouble (albeit without so many syslog messages) after booting the whole cluster, too.