• David Ahern's avatar
    sparc64: Convert BUG_ON to warning · 2bf7c3ef
    David Ahern authored
    Pagefault handling has a BUG_ON path that panics the system. Convert it to
    a warning instead. There is no need to bring down the system for this kind
    of failure.
    
    The following was hit while running:
        perf sched record -g -- make -j 16
    
    [3609412.782801] kernel BUG at /opt/dahern/linux.git/arch/sparc/mm/fault_64.c:416!
    [3609412.782833]               \|/ ____ \|/
    [3609412.782833]               "@'/ .. \`@"
    [3609412.782833]               /_| \__/ |_\
    [3609412.782833]                  \__U_/
    [3609412.782870] cat(4516): Kernel bad sw trap 5 [#1]
    [3609412.782889] CPU: 0 PID: 4516 Comm: cat Tainted: G            E   4.1.0-rc8+ #6
    [3609412.782909] task: fff8000126e31f80 ti: fff8000110d90000 task.ti: fff8000110d90000
    [3609412.782931] TSTATE: 0000004411001603 TPC: 000000000096b164 TNPC: 000000000096b168 Y: 0000004e    Tainted: G            E
    [3609412.782964] TPC: <do_sparc64_fault+0x5e4/0x6a0>
    [3609412.782979] g0: 000000000096abe0 g1: 0000000000d314c4 g2: 0000000000000000 g3: 0000000000000001
    [3609412.783009] g4: fff8000126e31f80 g5: fff80001302d2000 g6: fff8000110d90000 g7: 00000000000000ff
    [3609412.783045] o0: 0000000000aff6a8 o1: 00000000000001a0 o2: 0000000000000001 o3: 0000000000000054
    [3609412.783080] o4: fff8000100026820 o5: 0000000000000001 sp: fff8000110d935f1 ret_pc: 000000000096b15c
    [3609412.783117] RPC: <do_sparc64_fault+0x5dc/0x6a0>
    [3609412.783137] l0: 000007feff996000 l1: 0000000000030001 l2: 0000000000000004 l3: fff8000127bd0120
    [3609412.783174] l4: 0000000000000054 l5: fff8000127bd0188 l6: 0000000000000000 l7: fff8000110d9dba8
    [3609412.783210] i0: fff8000110d93f60 i1: fff8000110ca5530 i2: 000000000000003f i3: 0000000000000054
    [3609412.783244] i4: fff800010000081a i5: fff8000100000398 i6: fff8000110d936a1 i7: 0000000000407c6c
    [3609412.783286] I7: <sparc64_realfault_common+0x10/0x20>
    [3609412.783308] Call Trace:
    [3609412.783329]  [0000000000407c6c] sparc64_realfault_common+0x10/0x20
    [3609412.783353] Disabling lock debugging due to kernel taint
    [3609412.783379] Caller[0000000000407c6c]: sparc64_realfault_common+0x10/0x20
    [3609412.783449] Caller[fff80001002283e4]: 0xfff80001002283e4
    [3609412.783471] Instruction DUMP: 921021a0  7feaff91  901222a8 <91d02005> 82086100  02f87f7b  808a2873  81cfe008  01000000
    [3609412.783542] Kernel panic - not syncing: Fatal exception
    [3609412.784605] Press Stop-A (L1-A) to return to the boot prom
    [3609412.784615] ---[ end Kernel panic - not syncing: Fatal exception
    
    With this patch rather than a panic I occasionally get something like this:
        perf sched record -g -m 1024  -- make -j N
    
    where N is based on number of cpus (128 to 1024 for a T7-4 and 8 for an 8 cpu
    VM on a T5-2).
    
    WARNING: CPU: 211 PID: 52565 at /opt/dahern/linux.git/arch/sparc/mm/fault_64.c:417 do_sparc64_fault+0x340/0x70c()
    address (7feffcd6000) != regs->tpc (fff80001004873c0)
    Modules linked in: ipt_REJECT nf_reject_ipv4 nf_conntrack_ipv4 nf_defrag_ipv4 iptable_filter ip_tables ip6t_REJECT nf_reject_ipv6 xt_tcpudp nf_conntrack_ipv6 nf_defrag_ipv6 xt_state nf_conntrack ip6table_filter ip6_tables x_tables ipv6 cdc_ether usbnet mii ixgbe mdio igb i2c_algo_bit i2c_core ptp crc32c_sparc64 camellia_sparc64 des_sparc64 des_generic md5_sparc64 sha512_sparc64 sha1_sparc64 uio_pdrv_genirq uio usb_storage mpt3sas scsi_transport_sas raid_class aes_sparc64 sunvnet sunvdc sha256_sparc64(E) sha256_generic(E)
    CPU: 211 PID: 52565 Comm: ld Tainted: G        W   E   4.1.0-rc8+ #19
    Call Trace:
     [000000000045ce30] warn_slowpath_common+0x7c/0xa0
     [000000000045ceec] warn_slowpath_fmt+0x30/0x40
     [000000000098ad64] do_sparc64_fault+0x340/0x70c
     [0000000000407c2c] sparc64_realfault_common+0x10/0x20
    ---[ end trace 62ee02065a01a049 ]---
    ld[52565]: segfault at fff80001004873c0 ip fff80001004873c0 (rpc fff8000100158868) sp 000007feffcd70e1 error 30002 in libc-2.12.so[fff8000100410000+184000]
    
    The segfault is horrible, but better than a system panic.
    
    An 8-cpu VM on a T5-2 also showed the above traces from time to time,
    so it is a general problem and not specific to the T7 or baremetal.
    Signed-off-by: default avatarDavid Ahern <david.ahern@oracle.com>
    Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
    2bf7c3ef
fault_64.c 13.9 KB