Error message related to MIOP in syslog after Ceph crash

I’ve had some mysterious app crashes today on one of my Blade 3 nodes (blade3n4): The Ceph OSD service (the one, which addresses mass storage on a cluster node; osd0 runs on blade3n4) suddenly disappeared. The latter came right after miop-ep complaining about an unknown IRQ:

[ 5609.295485] miop-ep fe150000.pcie: miop_ep_elbi_int() unknown irq: 100
[ 5620.279945] miop-ep fe150000.pcie: miop_ep_elbi_int() unknown irq: 200
[ 5638.261852] Process 24925(apport) has RLIMIT_CORE set to 1
[ 5638.261871] Aborting core
[ 5638.529445] libceph: osd0 (1)10.20.0.14:6801 socket closed (con state OPEN)
[ 5638.840150] libceph: osd0 (1)10.20.0.14:6801 socket closed (con state V1_BANNER)
[ 5639.097072] libceph: osd0 (1)10.20.0.14:6801 socket error on write
[ 5639.177906] libceph (8aad3073-39a1-11f1-bf6e-f2704a1efa9b e8153): osd0 down

A deeper examination of the syslog surfaced quite a long error message with call trace:

[ 5646.164863] WARNING: CPU: 4 PID: 24883 at kernel/time/timer.c:1425 del_timer_sync+0x34/0x5c
[ 5646.164873] Modules linked in: ceph netfs nf_conntrack_netlink xt_nat veth vxlan ip6_udp_tunnel udp_tunnel xt_policy xt_mark xt_bpf xt_conntrack xt_MASQUERADE br_netfilter bridge stp llc xt_set ip_set nft_chain_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 xt_addrtype nft_compat nf_tables overlay qrtr miop_ep(O) miop_ep_net(PO) pcie_ep_rk35(PO) miop_reg(O) binfmt_misc pwm_fan rockchip_canfd can_dev panfrost drm_shmem_helper gpu_sched squashfs sch_fq_codel nvme_fabrics fuse nfnetlink ip_tables ipv6 r8169 dm_mod uio_pdrv_genirq uio
[ 5646.164938] CPU: 4 PID: 24883 Comm: queue44:src Tainted: P        W  O       6.1.0-1027-rockchip #27
[ 5646.164942] Hardware name: Mixtile Blade 3 (DT)
[ 5646.164945] pstate: 004000c9 (nzcv daIF +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
[ 5646.164949] pc : del_timer_sync+0x34/0x5c
[ 5646.164952] lr : miop_on_tx_ready+0xb8/0xd8 [miop_ep_net]
[ 5646.164960] sp : ffff80000aa3bdd0
[ 5646.164962] x29: ffff80000aa3bdd0 x28: ffff00015479bf00 x27: 0000000000000000
[ 5646.164967] x26: ffff0001060d0000 x25: ffff8000013131d0 x24: 0000000000000000
[ 5646.164971] x23: ffff0001031bab80 x22: ffff000101373010 x21: 00000000000020b8
[ 5646.164976] x20: ffff0001031b8a00 x19: ffff0001031bb3c0 x18: 0000000000000000
[ 5646.164981] x17: ffff8004f3e60000 x16: ffff80000aa38000 x15: 0000000000000000
[ 5646.164985] x14: 0000000000000000 x13: 0000000000000000 x12: 0000000000000000
[ 5646.164990] x11: ffff0003fc508488 x10: 0000000000000127 x9 : ffff80000130fb90
[ 5646.164994] x8 : 0000000000000000 x7 : 0000000000000000 x6 : ffff000100607d98
[ 5646.164999] x5 : ffff00010945e2a8 x4 : 0000000000000000 x3 : ffff00010ad3e908
[ 5646.165003] x2 : 0000000000000004 x1 : ffff0001032f9000 x0 : 0000000002400005
[ 5646.165008] Call trace:
[ 5646.165010]  del_timer_sync+0x34/0x5c
[ 5646.165014]  miop_on_tx_ready+0xb8/0xd8 [miop_ep_net]
[ 5646.165019]  rk35_ep_interrupt+0xb0/0x74c [pcie_ep_rk35]
[ 5646.165027]  __handle_irq_event_percpu+0xc0/0x1cc
[ 5646.165032]  handle_irq_event_percpu+0x20/0x54
[ 5646.165035]  handle_irq_event+0x50/0x94
[ 5646.165038]  handle_fasteoi_irq+0xac/0x134
[ 5646.165041]  handle_irq_desc+0x28/0x40
[ 5646.165044]  generic_handle_domain_irq+0x20/0x2c
[ 5646.165047]  __gic_handle_irq_from_irqson.isra.0+0x174/0x1c8
[ 5646.165052]  gic_handle_irq+0x88/0x90
[ 5646.165056]  call_on_irq_stack+0x24/0x4c
[ 5646.165059]  do_interrupt_handler+0x88/0xa8
[ 5646.165063]  el0_interrupt+0x78/0xac
[ 5646.165068]  __el0_irq_handler_common+0x18/0x24
[ 5646.165071]  el0t_64_irq_handler+0x10/0x1c
[ 5646.165075]  el0t_64_irq+0x19c/0x1a0

Is this a flaw in MIOP? Could please one of your developers look into this matter? Ceph can recover itself when one OSD crashes, but I really don’t wanna wait for a failure of the whole Ceph cluster. Thank you.

This may be a MIOP issue, but it should not occur here. I have changed del_timer_sync to del_timer. However, the most likely cause of this issue is a loose hardware connection. You may choose to perform an update.

Will this update also work on Ubuntu 24.04 LTS?

Do you mean this issue could be because of the node going loose from the backplane, and I should reseat it?