Yeah. I have already done the firmware update on the nodes. Please note that the error message I stated appeared on the control board, not on the nodes.
Please confirm whether the PCIe device is recognized normally and whether the negotiated rate is normal,
for example:
root@ClusterBox:~# lspci -s 03:00.0 -vvv|grep LnkSta
LnkSta: Speed 8GT/s, Width x4
LnkSta2: Current De-emphasis Level: -6dB, EqualizationComplete+ EqualizationPhase1+
Do you mean only on the control board, or also on the nodes?
This command is executed on the control board to check the connection status of bus 3, 4, 5, and 6. For example, check bus 3;
lspci -s 03:00.0 -vvv|grep LnkSta
Ah, buses 4 to 6 also! I’ve just run the command as stated by you, but I’ve got some errors:
mixtile@ClusterBox:~$ sudo lspci -s 03:00.0 -vvv|grep LnkSta
LnkSta: Speed 8GT/s, Width x2 (downgraded)
LnkSta2: Current De-emphasis Level: -6dB, EqualizationComplete+ EqualizationPhase1+
lspci: Unable to load libkmod resources: error -2
mixtile@ClusterBox:~$ sudo lspci -s 04:00.0 -vvv|grep LnkSta
LnkSta: Speed 8GT/s, Width x2 (downgraded)
LnkSta2: Current De-emphasis Level: -6dB, EqualizationComplete+ EqualizationPhase1+
lspci: Unable to load libkmod resources: error -2
mixtile@ClusterBox:~$ sudo lspci -s 05:00.0 -vvv|grep LnkSta
LnkSta: Speed 8GT/s, Width x2 (downgraded)
LnkSta2: Current De-emphasis Level: -6dB, EqualizationComplete+ EqualizationPhase1+
lspci: Unable to load libkmod resources: error -2
mixtile@ClusterBox:~$ sudo lspci -s 06:00.0 -vvv|grep LnkSta
LnkSta: Speed 8GT/s, Width x2 (downgraded)
LnkSta2: Current De-emphasis Level: -6dB, EqualizationComplete+ EqualizationPhase1+
lspci: Unable to load libkmod resources: error -2
When trying to run the command with strace, I even get more errors:
mixtile@ClusterBox:~$ sudo strace -e file lspci -s 03:00.0 -vvv |& grep ENOENT
open("/etc/ld-musl-mipsel-sf.path", O_RDONLY|O_LARGEFILE|O_CLOEXEC) = -1 ENOENT (No such file or directory)
open("/lib/libpci.so.3", O_RDONLY|O_LARGEFILE|O_CLOEXEC) = -1 ENOENT (No such file or directory)
open("/usr/local/lib/libpci.so.3", O_RDONLY|O_LARGEFILE|O_CLOEXEC) = -1 ENOENT (No such file or directory)
open("/lib/libkmod.so.2", O_RDONLY|O_LARGEFILE|O_CLOEXEC) = -1 ENOENT (No such file or directory)
open("/usr/local/lib/libkmod.so.2", O_RDONLY|O_LARGEFILE|O_CLOEXEC) = -1 ENOENT (No such file or directory)
open("/lib/libz.so.1", O_RDONLY|O_LARGEFILE|O_CLOEXEC) = -1 ENOENT (No such file or directory)
open("/usr/local/lib/libz.so.1", O_RDONLY|O_LARGEFILE|O_CLOEXEC) = -1 ENOENT (No such file or directory)
open("/root/.pciids-cache", O_RDONLY|O_LARGEFILE) = -1 ENOENT (No such file or directory)
open("/sys/bus/pci/devices/0000:03:00.0/label", O_RDONLY|O_LARGEFILE) = -1 ENOENT (No such file or directory)
open("/sys/bus/pci/devices/0000:03:00.0/numa_node", O_RDONLY|O_LARGEFILE) = -1 ENOENT (No such file or directory)
readlink("/sys/bus/pci/devices/0000:03:00.0/iommu_group", 0x7fb2b39c, 1024) = -1 ENOENT (No such file or directory)
readlink("/sys/bus/pci/devices/0000:03:00.0/of_node", 0x7fb2b39c, 1024) = -1 ENOENT (No such file or directory)
open("/sys/module/compression", O_RDONLY|O_LARGEFILE|O_CLOEXEC) = -1 ENOENT (No such file or directory)
statx(AT_FDCWD, "/etc/modprobe.d", AT_STATX_SYNC_AS_STAT, STATX_BASIC_STATS, 0x7fb2a580) = -1 ENOENT (No such file or directory)
statx(AT_FDCWD, "/run/modprobe.d", AT_STATX_SYNC_AS_STAT, STATX_BASIC_STATS, 0x7fb2a580) = -1 ENOENT (No such file or directory)
statx(AT_FDCWD, "/usr/local/lib/modprobe.d", AT_STATX_SYNC_AS_STAT, STATX_BASIC_STATS, 0x7fb2a580) = -1 ENOENT (No such file or directory)
statx(AT_FDCWD, "/lib/modprobe.d", AT_STATX_SYNC_AS_STAT, STATX_BASIC_STATS, 0x7fb2a580) = -1 ENOENT (No such file or directory)
open("/lib/modules/5.15.150/modules.softdep", O_RDONLY|O_LARGEFILE|O_CLOEXEC) = -1 ENOENT (No such file or directory)
open("/lib/modules/5.15.150/modules.dep.bin", O_RDONLY|O_LARGEFILE|O_CLOEXEC) = -1 ENOENT (No such file or directory)
In the lspci command log, it is seen that the Speed is 8GT/s and the Width is x2. The negotiated one is 2 lanes. I would like to ask if each blade3 is connected to an M.2 SSD. In addition, the error message of the lspci command can be ignored and will not affect it for the time being.
No, there are (still) no M.2 SSDs implanted into the cluster box, but I’m planning to do that.
BTW, the speed of the LAN port of the control board has suddenly dropped to only 100 MBit/sec.
Judging from the logs and descriptions you provided, your four Blade3s have been burned https://downloads.mixtile.com/cluster-box/blade3-ubuntu-images/ubuntu-24.04-preinstalled-desktop-arm64-mixtile-blade3-rockchip-format.zip and no SSD is connected. Your normal recognition of PCIe should be Speed 8GT/s and Width x4. But now it shows Speed 8GT/s and Width x2, which is a problem. My suggestion is to check if there are any contact issues with balde3 or re-flash the firmware for testing.
Already done that. No problems found.
Nope. In fact, I’ve got a version of Debian Bookworm. Does the version of Ubuntu you mentioned fully support the Cluster Box’s MIOP? Does this also apply to your server version?
root@blade3:/# neofetch
_,met$$$$$gg. root@blade3
,g$$$$$$$$$$$$$$$P. -----------
,g$$P" """Y$$.". OS: Debian GNU/Linux 12 (bookworm) aarch64
,$$P' `$$$. Host: Mixtile Blade 3 v1.0.1
',$$P ,ggs. `$$b: Kernel: 6.1.99
`d$$' ,$P"' . $$$ Uptime: 5 mins
$$P d$' , $$P Packages: 1460 (dpkg)
$$: $$. - ,d$$' Shell: bash 5.2.15
$$; Y$b._ _,d$P' WM: Xfwm4
Y$$. `.`"Y$$$$P"' Theme: Adwaita [GTK3]
`$$b "-.__ Icons: Adwaita [GTK3]
`Y$$ CPU: (8) @ 1.800GHz
`Y$$. Memory: 486MiB / 15947MiB
`$$b.
`Y$$b.
`"Y$b._
`"""
Exactly. But: I’d like to do that once the cluster works as expected.
Does the nodectl flash command work now?
Currently, the ubuntu desktop version offers better support, so it is recommended that you use this version of the firmware. The nodectl flash function is not yet available. You still need to flash a single device.
OK, I’ve just flashed all four nodes with the Ubuntu distro you sent me, but:
- Apparently, you must set the node into MaskROM mode before flashing, as otherwise the node won’t be recognised.
- The new distro does not solve original problem with the breaking connection when trying to download something to some node. Additionally, the control board now becomes very slow in some circumstances.
Node 1:
mixtile@mixtile-ubuntu:~$ sudo apt-get update
Hit:1 http://ports.ubuntu.com noble InRelease
Get:2 http://ports.ubuntu.com noble-updates InRelease [126 kB]
Get:3 http://ports.ubuntu.com noble-backports InRelease [126 kB]
Get:4 https://ppa.launchpadcontent.net/jjriek/panfork-mesa/ubuntu noble InRelease [17.8 kB]
Get:5 http://ports.ubuntu.com noble-security InRelease [126 kB]
0% [5 InRelease 2,572 B/126 kB 2%]client_loop: send disconnect: Broken pipe
Control board:
[ 308.581738] miop 0000:06:00.0: DMA timeout, restart DMA controller.
[ 309.591778] miop 0000:06:00.0: DMA timeout, restart DMA controller.
[ 310.601900] miop 0000:06:00.0: DMA timeout, restart DMA controller.
[ 311.611940] miop 0000:06:00.0: DMA timeout, restart DMA controller.
BTW, this is what I get in the control board’s syslog directly after booting up just after I had received the broken pipe:
mixtile@ClusterBox:~$ sudo dmesg | tail -35
[ 33.273584] miop 0000:03:00.0: probing MIOP node on bus:03
[ 34.036673] miop 0000:03:00.0: PCIe bus number 3 mapped to MIOP node id: 2
[ 34.044954] miop 0000:03:00.0: pci_alloc_irq_vectors() only alloc 1 vectors
[ 34.062477] miop 0000:03:00.0: miop irq on tx ready
[ 34.111196] miop 0000:03:00.0: MIOP node[2] on bus:03 is online
[ 34.117678] miop 0000:04:00.0: card - bus=0x4, slot = 0x0 irq=4
[ 34.123844] miop 0000:04:00.0: probing MIOP node on bus:04
[ 34.129460] miop 0000:04:00.0: PCIe bus number 4 mapped to MIOP node id: 3
[ 34.137619] miop 0000:04:00.0: pci_alloc_irq_vectors() only alloc 1 vectors
[ 34.155713] miop 0000:04:00.0: miop irq on tx ready
[ 34.244668] miop 0000:04:00.0: MIOP node[3] on bus:04 is online
[ 34.251110] miop 0000:05:00.0: card - bus=0x5, slot = 0x0 irq=4
[ 34.257278] miop 0000:05:00.0: probing MIOP node on bus:05
[ 34.262901] miop 0000:05:00.0: PCIe bus number 5 mapped to MIOP node id: 1
[ 34.271059] miop 0000:05:00.0: pci_alloc_irq_vectors() only alloc 1 vectors
[ 34.289820] miop 0000:05:00.0: miop irq on tx ready
[ 34.359149] miop 0000:05:00.0: MIOP node[1] on bus:05 is online
[ 34.365618] miop 0000:06:00.0: card - bus=0x6, slot = 0x0 irq=4
[ 34.371761] miop 0000:06:00.0: probing MIOP node on bus:06
[ 34.377377] miop 0000:06:00.0: PCIe bus number 6 mapped to MIOP node id: 0
[ 34.385556] miop 0000:06:00.0: pci_alloc_irq_vectors() only alloc 1 vectors
[ 34.404051] miop 0000:06:00.0: miop irq on tx ready
[ 34.443187] miop 0000:06:00.0: MIOP node[0] on bus:06 is online
[ 35.089768] 8021q: adding VLAN 0 to HW filter on device eth0
[ 35.146707] device eth0 entered promiscuous mode
[ 35.171010] br-lan: port 1(eth0.1) entered blocking state
[ 35.176613] br-lan: port 1(eth0.1) entered disabled state
[ 35.182607] device eth0.1 entered promiscuous mode
[ 36.120473] IPv6: ADDRCONF(NETDEV_CHANGE): pci0: link becomes ready
[ 39.240293] mtk_soc_eth 10100000.ethernet eth0: port 5 link up (100Mbps/Full duplex)
[ 39.265169] IPv6: ADDRCONF(NETDEV_CHANGE): eth0: link becomes ready
[ 39.290115] br-lan: port 1(eth0.1) entered blocking state
[ 39.295711] br-lan: port 1(eth0.1) entered forwarding state
[ 39.301857] IPv6: ADDRCONF(NETDEV_CHANGE): eth0.2: link becomes ready
[ 39.393188] IPv6: ADDRCONF(NETDEV_CHANGE): br-lan: link becomes ready
I have already done that, @Buyuliang ! As stated in another thread, I’m now at least able to work around this issue by running an HTTP proxy on the control board and making all blades use it when contacting external web servers. It’s not the best solution, but at least it works.