Network errors when booting up a Mixtile Blade 3 as cluster node

I’m experiencing severe problems with the internal network of the Cluster Box: Accessing the external network from the control board typically works fine, but when trying to get into the external network from a cluster node (Blade 3; no matter, which one), I get hangups when downloading > 1 MB of data. Now I rebooted one of the nodes via control board, logged into the node, and got a bunch of error messages, which seem to be related to a networking issue:

root@blade3n1:~# dmesg | tail -35
[   57.720852] r8169 0002:24:00.0: Unable to load firmware rtl_nic/rtl8125b-2.fw (-2)
[   57.747464] RTL8226B_RTL8221B 2.5Gbps PHY r8169-2-2400:00: attached PHY driver [RTL8226B_RTL8221B 2.5Gbps PHY] (mii_bus:phy_addr=r8169-2-2400:00, irq=IGNORE)
[   57.867650] r8169 0002:24:00.0 enP2p36s0: Link is Down
[   58.228952] usb_gadget_probe_driver udc_name=fc000000.usb, dev_name=fc000000.usb
[   58.360878] android_work: did not send uevent (0 0 0000000000000000)
[   58.517711] mali fb000000.gpu: Loading Mali firmware 0x1010000
[   58.518162] mali fb000000.gpu: Protected memory allocator not found, Firmware protected mode entry will not be supported
[   58.518174] mali fb000000.gpu: Protected memory allocator not found, Firmware protected mode entry will not be supported
[   58.518181] mali fb000000.gpu: Protected memory allocator not found, Firmware protected mode entry will not be supported
[   60.216618] systemd-journald[281]: File /var/log/journal/24b0e758b78949088e06ee4bfcbcef83/user-1000.journal corrupted or uncleanly shut down, renaming and replacing.
[   62.425928] ttyFIQ ttyFIQ0: tty_port_close_start: tty->count = 1 port count = 2
[   65.120067] rockchip-csi2-dphy0: No link between dphy and sensor
[   65.120090] rkcif-mipi-lvds2: rkcif_update_sensor_info: stream[0] get remote terminal sensor failed!
[   65.120094] stream_cif_mipi_id0: update sensor info failed -19
[   65.120255] rockchip-csi2-dphy0: No link between dphy and sensor
[   65.120268] rkcif-mipi-lvds2: rkcif_update_sensor_info: stream[1] get remote terminal sensor failed!
[   65.120271] stream_cif_mipi_id1: update sensor info failed -19
[   65.120339] rockchip-csi2-dphy0: No link between dphy and sensor
[   65.120345] rkcif-mipi-lvds2: rkcif_update_sensor_info: stream[2] get remote terminal sensor failed!
[   65.120347] stream_cif_mipi_id2: update sensor info failed -19
[   65.121224] rockchip-csi2-dphy0: No link between dphy and sensor
[   65.121256] rkcif-mipi-lvds2: rkcif_update_sensor_info: stream[3] get remote terminal sensor failed!
[   65.121267] stream_cif_mipi_id3: update sensor info failed -19
[   65.121915] rockchip-csi2-dphy0: No link between dphy and sensor
[   65.121936] rkcif-mipi-lvds2: rkcif_update_sensor_info: stream[0] get remote terminal sensor failed!
[   65.121943] rkcif_scale_ch0: update sensor info failed -19
[   65.123307] rockchip-csi2-dphy0: No link between dphy and sensor
[   65.123328] rkcif-mipi-lvds2: rkcif_update_sensor_info: stream[1] get remote terminal sensor failed!
[   65.123345] rkcif_scale_ch1: update sensor info failed -19
[   65.123553] rockchip-csi2-dphy0: No link between dphy and sensor
[   65.123570] rkcif-mipi-lvds2: rkcif_update_sensor_info: stream[2] get remote terminal sensor failed!
[   65.123577] rkcif_scale_ch2: update sensor info failed -19
[   65.124184] rockchip-csi2-dphy0: No link between dphy and sensor
[   65.124200] rkcif-mipi-lvds2: rkcif_update_sensor_info: stream[3] get remote terminal sensor failed!
[   65.124209] rkcif_scale_ch3: update sensor info failed -19

What’s wrong here? The MIOP firmware version miop-control-blade3-arm64-v0.0.3-20240523 installed on the nodes is already up-to-date, and I am using the revised version of the control board (with the Oculink connectors at the back).

UPDATE: On another node, I even get another PCIe-related error message afterwards:

root@blade3n4:~# dmesg | tail -35
[   64.577332] rockchip-csi2-dphy0: No link between dphy and sensor
[   64.577337] rkcif-mipi-lvds2: rkcif_update_sensor_info: stream[3] get remote terminal sensor failed!
[   64.577340] stream_cif_mipi_id3: update sensor info failed -19
[   64.577543] rockchip-csi2-dphy0: No link between dphy and sensor
[   64.577551] rkcif-mipi-lvds2: rkcif_update_sensor_info: stream[0] get remote terminal sensor failed!
[   64.577554] rkcif_scale_ch0: update sensor info failed -19
[   64.577632] rockchip-csi2-dphy0: No link between dphy and sensor
[   64.577638] rkcif-mipi-lvds2: rkcif_update_sensor_info: stream[1] get remote terminal sensor failed!
[   64.577640] rkcif_scale_ch1: update sensor info failed -19
[   64.577735] rockchip-csi2-dphy0: No link between dphy and sensor
[   64.578001] rkcif-mipi-lvds2: rkcif_update_sensor_info: stream[2] get remote terminal sensor failed!
[   64.578005] rkcif_scale_ch2: update sensor info failed -19
[   64.578110] rockchip-csi2-dphy0: No link between dphy and sensor
[   64.578116] rkcif-mipi-lvds2: rkcif_update_sensor_info: stream[3] get remote terminal sensor failed!
[   64.578118] rkcif_scale_ch3: update sensor info failed -19
[   71.867113] miop-ep fe150000.pcie: TX[0]: Queue is full.
[   71.867175] miop-ep fe150000.pcie: TX[1]: Queue is full.
[   88.938474] miop-ep fe150000.pcie: TX[0]: Queue is full.
[   88.938535] miop-ep fe150000.pcie: TX[1]: Queue is full.
[  121.786155] miop-ep fe150000.pcie: TX[0]: Queue is full.
[  121.786217] miop-ep fe150000.pcie: TX[1]: Queue is full.
[  190.392068] miop-ep fe150000.pcie: TX[0]: Queue is full.
[  190.392130] miop-ep fe150000.pcie: TX[1]: Queue is full.
[  319.472388] miop-ep fe150000.pcie: TX[0]: Queue is full.
[  319.472451] miop-ep fe150000.pcie: TX[1]: Queue is full.
[  329.097643] miop-ep fe150000.pcie: TX[0]: Queue is full.
[  329.097704] miop-ep fe150000.pcie: TX[1]: Queue is full.
[  350.139620] ttyFIQ ttyFIQ0: tty_port_close_start: tty->count = 1 port count = 3
[  350.485478] ttyFIQ ttyFIQ0: tty_port_close_start: tty->count = 1 port count = 2
[  609.315517] miop-ep fe150000.pcie: TX[0]: Queue is full.
[  609.315578] miop-ep fe150000.pcie: TX[1]: Queue is full.
[ 1153.024487] miop-ep fe150000.pcie: TX[0]: Queue is full.
[ 1153.024548] miop-ep fe150000.pcie: TX[1]: Queue is full.
[ 2233.728447] miop-ep fe150000.pcie: TX[0]: Queue is full.
[ 2233.728508] miop-ep fe150000.pcie: TX[1]: Queue is full.

Dear Jacek,
From the description of the device phenomenon, I know that when you restart one of blade3, other devices are abnormal. Since the device does not support hotplugging at present, the restart of a single block device will affect the normal use of other devices. The correct way is to restart the whole device, and we are also trying to optimize this part, so as to provide the best use experience.

I get the same trouble (albeit without so many syslog messages) after booting the whole cluster, too.

When I run apt update on one of my nodes, blade3, I experience very slow speeds, and the BMC with OpenWRT reboots afterward. As a result, I can’t perform any network-related actions. When attempting to install packages, the network speed drops significantly.

root@blade3-n2:~# apt update
Get:1 http://mirrors.ustc.edu.cn/debian bullseye InRelease [116 kB]
Get:2 http://mirrors.ustc.edu.cn/debian-security bullseye-security InRelease [27.2 kB]
Get:3 http://mirrors.ustc.edu.cn/debian bullseye-updates InRelease [44.1 kB]
Get:4 http://mirrors.ustc.edu.cn/debian bullseye/main Sources [8500 kB]
15% [4 Sources 1123 kB/8500 kB 13%]


openwrt bmc

[  263.118215] hrtimer: interrupt took 79997 ns
[  427.098241] miop 0000:05:00.0: DMA timeout, restart DMA controller.
[  428.118282] miop 0000:05:00.0: DMA timeout, restart DMA controller.
[  429.128200] miop 0000:05:00.0: DMA timeout, restart DMA controller.
[  430.138259] miop 0000:05:00.0: DMA timeout, restart DMA controller.
[  431.148238] miop 0000:05:00.0: DMA timeout, restart DMA controller.
[  432.168203] miop 0000:05:00.0: DMA timeout, restart DMA controller.
[  433.178243] miop 0000:05:00.0: DMA timeout, restart DMA controller.
[  434.188238] miop 0000:05:00.0: DMA timeout, restart DMA controller.
[  435.198196] miop 0000:05:00.0: DMA timeout, restart DMA controller.

@Buyuliang Is it true that there are still issues with the miop driver, which (also) controls the PCIe switch the blades are attached to?

Yes, there are still some issues about the miop driver. For some scenarios, there will be DMA timeout issues, and we are trying to reproduce and solve the problem

January is now almost over. Have you managed to resolve the driver issue?

I have done more. I try with iperf3 to communicate between the nodes and I get speeds of even 5.6 Gbit/sec, but when I increase the number of connections I get errors and even sudden reboots of the ClusterBox. While the connections with the control card via iperf3 are 250 Mbit/sec, I get errors like this:

[ 558.999618] miop-ep fe150000.pcie: miop_queue_put() failed for reuse buffer.
[ 559.215913] miop-ep fe150000.pcie: miop_queue_put() failed for reuse buffer.
[ 559.216478] miop-ep fe150000.pcie: miop_queue_put() failed for reuse buffer.

When the number of connections in transit increases.

This error occurs during a “home made” routing between blades n2 and n3 (central to the cluster). Are we talking about exhaustion of the memory space reserved for the addresses? In fact, the situation does not improve even using:

sudo ip link set dev pci0 txqueuelen 100000 (on each node)

It seems that when this error occurs, the condition tends to worsen (new messages in dmesg after each routing, [not clean ssh connection]) until the connection becomes impossible due to the slowdown. Does an accumulation of “info(!!)” occur in the address buffer or similar? Why is this situation not reversible?

Do you have any advice?
From node blade:

sudo ethtool pci0
No data available

On blade3 side with miop-ep driver, how is it possible to manipulate the values ​​of:

tx ring
sizes of TX socket buffers?

Thank you infinity

this is the output of:

sudo tcpdump -i pci0

09:33:11.323700 IP blade3-N1.41122 > blade3-N4.ssh: Flags [.], ack 45113, win 512, options [nop,nop,TS val 1305005630 ecr 2533616080], length 0
09:33:11.324439 IP blade3-N4.ssh > blade3-N1.41122: Flags [P.], seq 45113:53377, ack 4269, win 509, options [nop,nop,TS val 2533616082 ecr 1305005630], length 8264
09:33:11.324763 IP blade3-N4.ssh > blade3-N1.41122: Flags [.], seq 53377:86017, ack 4269, win 509, options [nop,nop,TS val 2533616082 ecr 1305005630], length 32640

Can anyone tell me if the values ​​are abnormal?
It is a routing between blades, that is to exploit the pcie connection. It seems that excessive traffic creates this kind of error.

Well, I haven’t reached the point where I could test the speed of the network connection between nodes, but I’ve made the experience that when sucking an O/S update via apt-get upgrade (Internet → control board → node), the connection already stalls after some 5 MB of data, so it’s not really “excessive traffic”.

Good news.

Prof.
I didn’t notice this, however i was already thinking of creating a route to the gateway of the network card of the Blade3 2.5Gb/s work.

I was saying:
I partially solve the problem of the connection interruption following an excessive number of errors like the one described; it is a palliative but for the moment it seems that it can guarantee a minimum of stability in case of connection forwarding.

As recommended by our good boss ChatGPT: it is a matter of activating RPS (Receive Packet Steering) to balance the load of the network IRQs between different CPUs.

Solution: (on every Blade as root, no sudo)

echo 2 > /sys/class/net//queues/rx-0/rps_cpus
reboot

By the way, it could also be useful in the case of the control card, but i don’t feel like talking to that.

Thanks you so far

Your command (or that you get from ChatGPT) does not work on my nodes, as the subdir /sys/class/net/queues/ is missing. BTW, the double slash // in the directory you stated is obviously wrong.

UPDATE: The correct path reads as follows: /sys/class/net/pci0/queues/rx-0/rps_cpus. And no, I still have network issues with error messages like these two ones on the control board side:

  • [ 660.751012] miop-ep fe150000.pcie: TX[0]: Queue is full.
  • [ 505.970795] miop 0000:06:00.0: DMA timeout, restart DMA controller.

Hi, There is an update for the TCP/IP over PCIe (MIOP) driver of the ClusterBox product, which resolves the following two issues:

  1. Fix random driver collapse.
  2. Resolve the issue of a full PCIe queue.
    Please refer to the following link for the update details and usage tutorial.
    ClusterBox MIOP Driver Update Instructions | Mixtile
1 Like

With ClusterBox MT7620A, do you mean the control board?

Yes, it indeed refers to the control board. I sincerely apologize for the confusion caused. We will promptly revise the instructions and ensure they are more accurate in the future.

OK, I’ve now updated the MIOP driver on the control board following your instructions. About the new Debian image, I’ve found a mismatch between the version you recommend in your firmware upgrade guide: image-release-blade3-debian11-20230505.img

…and the version I’ve already got on my blades: Linux blade3n1 5.10.66 #127 SMP Mon Oct 30 14:11:23 CST 2023 aarch64

Additionally, I get many lines of an obscure error message in the syslog of the control board, which look like this:

[ 504.173322] miop 0000:06:00.0: DMA timeout, restart DMA controller.

So the image you let me download is even slighty older (2023-05-05) than the one I’ve already got (2023-10-30). Or do you generally recommend to install the Ubuntu image?

We highly recommend installing the Ubuntu image via the link provided.
Blade3 Firmware Download Link
For the Debian image, we are currently resolving some issues. The updated version, which will include the new MIOP driver, is scheduled for release by this Friday.

OK, thank you. But: The last error message (the DMA issue) I sent you came from the control board, not from any of the nodes. Does it mean that the current version of the MIOP driver for the control board also has an issue?

When Blade 3 accesses the internet via the Control Board, it may briefly encounter DMA timeout errors. It is recommended to directly use the 2.5GbE port on each Blade 3 for internet access, as this provides faster and more stable connectivity.​

No, this is not the case: In fact, the nodes don’t even see each other. Neither do they see the control board. DHCP requests don’t arrive at the control board. Despite that, I get the DMA error every three seconds, even when the nodes don’t even try to access the network.

I can try to install the Ubuntu image on the nodes, but I am skeptical whether this will solve my issue.