I’ve had some mysterious app crashes today on one of my Blade 3 nodes (blade3n4): The Ceph OSD service (the one, which addresses mass storage on a cluster node; osd0 runs on blade3n4) suddenly disappeared. The latter came right after miop-ep complaining about an unknown IRQ:
[ 5609.295485] miop-ep fe150000.pcie: miop_ep_elbi_int() unknown irq: 100
[ 5620.279945] miop-ep fe150000.pcie: miop_ep_elbi_int() unknown irq: 200
[ 5638.261852] Process 24925(apport) has RLIMIT_CORE set to 1
[ 5638.261871] Aborting core
[ 5638.529445] libceph: osd0 (1)10.20.0.14:6801 socket closed (con state OPEN)
[ 5638.840150] libceph: osd0 (1)10.20.0.14:6801 socket closed (con state V1_BANNER)
[ 5639.097072] libceph: osd0 (1)10.20.0.14:6801 socket error on write
[ 5639.177906] libceph (8aad3073-39a1-11f1-bf6e-f2704a1efa9b e8153): osd0 down
A deeper examination of the syslog surfaced quite a long error message with call trace:
[ 5646.164863] WARNING: CPU: 4 PID: 24883 at kernel/time/timer.c:1425 del_timer_sync+0x34/0x5c
[ 5646.164873] Modules linked in: ceph netfs nf_conntrack_netlink xt_nat veth vxlan ip6_udp_tunnel udp_tunnel xt_policy xt_mark xt_bpf xt_conntrack xt_MASQUERADE br_netfilter bridge stp llc xt_set ip_set nft_chain_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 xt_addrtype nft_compat nf_tables overlay qrtr miop_ep(O) miop_ep_net(PO) pcie_ep_rk35(PO) miop_reg(O) binfmt_misc pwm_fan rockchip_canfd can_dev panfrost drm_shmem_helper gpu_sched squashfs sch_fq_codel nvme_fabrics fuse nfnetlink ip_tables ipv6 r8169 dm_mod uio_pdrv_genirq uio
[ 5646.164938] CPU: 4 PID: 24883 Comm: queue44:src Tainted: P W O 6.1.0-1027-rockchip #27
[ 5646.164942] Hardware name: Mixtile Blade 3 (DT)
[ 5646.164945] pstate: 004000c9 (nzcv daIF +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
[ 5646.164949] pc : del_timer_sync+0x34/0x5c
[ 5646.164952] lr : miop_on_tx_ready+0xb8/0xd8 [miop_ep_net]
[ 5646.164960] sp : ffff80000aa3bdd0
[ 5646.164962] x29: ffff80000aa3bdd0 x28: ffff00015479bf00 x27: 0000000000000000
[ 5646.164967] x26: ffff0001060d0000 x25: ffff8000013131d0 x24: 0000000000000000
[ 5646.164971] x23: ffff0001031bab80 x22: ffff000101373010 x21: 00000000000020b8
[ 5646.164976] x20: ffff0001031b8a00 x19: ffff0001031bb3c0 x18: 0000000000000000
[ 5646.164981] x17: ffff8004f3e60000 x16: ffff80000aa38000 x15: 0000000000000000
[ 5646.164985] x14: 0000000000000000 x13: 0000000000000000 x12: 0000000000000000
[ 5646.164990] x11: ffff0003fc508488 x10: 0000000000000127 x9 : ffff80000130fb90
[ 5646.164994] x8 : 0000000000000000 x7 : 0000000000000000 x6 : ffff000100607d98
[ 5646.164999] x5 : ffff00010945e2a8 x4 : 0000000000000000 x3 : ffff00010ad3e908
[ 5646.165003] x2 : 0000000000000004 x1 : ffff0001032f9000 x0 : 0000000002400005
[ 5646.165008] Call trace:
[ 5646.165010] del_timer_sync+0x34/0x5c
[ 5646.165014] miop_on_tx_ready+0xb8/0xd8 [miop_ep_net]
[ 5646.165019] rk35_ep_interrupt+0xb0/0x74c [pcie_ep_rk35]
[ 5646.165027] __handle_irq_event_percpu+0xc0/0x1cc
[ 5646.165032] handle_irq_event_percpu+0x20/0x54
[ 5646.165035] handle_irq_event+0x50/0x94
[ 5646.165038] handle_fasteoi_irq+0xac/0x134
[ 5646.165041] handle_irq_desc+0x28/0x40
[ 5646.165044] generic_handle_domain_irq+0x20/0x2c
[ 5646.165047] __gic_handle_irq_from_irqson.isra.0+0x174/0x1c8
[ 5646.165052] gic_handle_irq+0x88/0x90
[ 5646.165056] call_on_irq_stack+0x24/0x4c
[ 5646.165059] do_interrupt_handler+0x88/0xa8
[ 5646.165063] el0_interrupt+0x78/0xac
[ 5646.165068] __el0_irq_handler_common+0x18/0x24
[ 5646.165071] el0t_64_irq_handler+0x10/0x1c
[ 5646.165075] el0t_64_irq+0x19c/0x1a0
Is this a flaw in MIOP? Could please one of your developers look into this matter? Ceph can recover itself when one OSD crashes, but I really don’t wanna wait for a failure of the whole Ceph cluster. Thank you.