1. 05 Jul, 2017 40 commits
    • Guillaume Nault's avatar
      l2tp: ensure session can't get removed during pppol2tp_session_ioctl() · 806e9883
      Guillaume Nault authored
      commit 57377d63 upstream.
      
      Holding a reference on session is required before calling
      pppol2tp_session_ioctl(). The session could get freed while processing the
      ioctl otherwise. Since pppol2tp_session_ioctl() uses the session's socket,
      we also need to take a reference on it in l2tp_session_get().
      
      Fixes: fd558d18 ("l2tp: Split pppol2tp patch into separate l2tp and ppp parts")
      Signed-off-by: default avatarGuillaume Nault <g.nault@alphalink.fr>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarAmit Pundir <amit.pundir@linaro.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      806e9883
    • Guillaume Nault's avatar
      l2tp: fix race in l2tp_recv_common() · 6539c4f9
      Guillaume Nault authored
      commit 61b9a047 upstream.
      
      Taking a reference on sessions in l2tp_recv_common() is racy; this
      has to be done by the callers.
      
      To this end, a new function is required (l2tp_session_get()) to
      atomically lookup a session and take a reference on it. Callers then
      have to manually drop this reference.
      
      Fixes: fd558d18 ("l2tp: Split pppol2tp patch into separate l2tp and ppp parts")
      Signed-off-by: default avatarGuillaume Nault <g.nault@alphalink.fr>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarAmit Pundir <amit.pundir@linaro.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      6539c4f9
    • Baolin Wang's avatar
      usb: gadget: f_fs: Fix possibe deadlock · d2da8d39
      Baolin Wang authored
      commit b3ce3ce0 upstream.
      
      When system try to close /dev/usb-ffs/adb/ep0 on one core, at the same
      time another core try to attach new UDC, which will cause deadlock as
      below scenario. Thus we should release ffs lock before issuing
      unregister_gadget_item().
      
      [   52.642225] c1 ======================================================
      [   52.642228] c1 [ INFO: possible circular locking dependency detected ]
      [   52.642236] c1 4.4.6+ #1 Tainted: G        W  O
      [   52.642241] c1 -------------------------------------------------------
      [   52.642245] c1 usb ffs open/2808 is trying to acquire lock:
      [   52.642270] c0  (udc_lock){+.+.+.}, at: [<ffffffc00065aeec>]
      		usb_gadget_unregister_driver+0x3c/0xc8
      [   52.642272] c1  but task is already holding lock:
      [   52.642283] c0  (ffs_lock){+.+.+.}, at: [<ffffffc00066b244>]
      		ffs_data_clear+0x30/0x140
      [   52.642285] c1 which lock already depends on the new lock.
      [   52.642287] c1
                     the existing dependency chain (in reverse order) is:
      [   52.642295] c0
      	       -> #1 (ffs_lock){+.+.+.}:
      [   52.642307] c0        [<ffffffc00012340c>] __lock_acquire+0x20f0/0x2238
      [   52.642314] c0        [<ffffffc000123b54>] lock_acquire+0xe4/0x298
      [   52.642322] c0        [<ffffffc000aaf6e8>] mutex_lock_nested+0x7c/0x3cc
      [   52.642328] c0        [<ffffffc00066f7bc>] ffs_func_bind+0x504/0x6e8
      [   52.642334] c0        [<ffffffc000654004>] usb_add_function+0x84/0x184
      [   52.642340] c0        [<ffffffc000658ca4>] configfs_composite_bind+0x264/0x39c
      [   52.642346] c0        [<ffffffc00065b348>] udc_bind_to_driver+0x58/0x11c
      [   52.642352] c0        [<ffffffc00065b49c>] usb_udc_attach_driver+0x90/0xc8
      [   52.642358] c0        [<ffffffc0006598e0>] gadget_dev_desc_UDC_store+0xd4/0x128
      [   52.642369] c0        [<ffffffc0002c14e8>] configfs_write_file+0xd0/0x13c
      [   52.642376] c0        [<ffffffc00023c054>] vfs_write+0xb8/0x214
      [   52.642381] c0        [<ffffffc00023cad4>] SyS_write+0x54/0xb0
      [   52.642388] c0        [<ffffffc000085ff0>] el0_svc_naked+0x24/0x28
      [   52.642395] c0
                    -> #0 (udc_lock){+.+.+.}:
      [   52.642401] c0        [<ffffffc00011e3d0>] print_circular_bug+0x84/0x2e4
      [   52.642407] c0        [<ffffffc000123454>] __lock_acquire+0x2138/0x2238
      [   52.642412] c0        [<ffffffc000123b54>] lock_acquire+0xe4/0x298
      [   52.642420] c0        [<ffffffc000aaf6e8>] mutex_lock_nested+0x7c/0x3cc
      [   52.642427] c0        [<ffffffc00065aeec>] usb_gadget_unregister_driver+0x3c/0xc8
      [   52.642432] c0        [<ffffffc00065995c>] unregister_gadget_item+0x28/0x44
      [   52.642439] c0        [<ffffffc00066b34c>] ffs_data_clear+0x138/0x140
      [   52.642444] c0        [<ffffffc00066b374>] ffs_data_reset+0x20/0x6c
      [   52.642450] c0        [<ffffffc00066efd0>] ffs_data_closed+0xac/0x12c
      [   52.642454] c0        [<ffffffc00066f070>] ffs_ep0_release+0x20/0x2c
      [   52.642460] c0        [<ffffffc00023dbe4>] __fput+0xb0/0x1f4
      [   52.642466] c0        [<ffffffc00023dd9c>] ____fput+0x20/0x2c
      [   52.642473] c0        [<ffffffc0000ee944>] task_work_run+0xb4/0xe8
      [   52.642482] c0        [<ffffffc0000cd45c>] do_exit+0x360/0xb9c
      [   52.642487] c0        [<ffffffc0000cf228>] do_group_exit+0x4c/0xb0
      [   52.642494] c0        [<ffffffc0000dd3c8>] get_signal+0x380/0x89c
      [   52.642501] c0        [<ffffffc00008a8f0>] do_signal+0x154/0x518
      [   52.642507] c0        [<ffffffc00008af00>] do_notify_resume+0x70/0x78
      [   52.642512] c0        [<ffffffc000085ee8>] work_pending+0x1c/0x20
      [   52.642514] c1
                    other info that might help us debug this:
      [   52.642517] c1  Possible unsafe locking scenario:
      [   52.642518] c1        CPU0                    CPU1
      [   52.642520] c1        ----                    ----
      [   52.642525] c0   lock(ffs_lock);
      [   52.642529] c0                                lock(udc_lock);
      [   52.642533] c0                                lock(ffs_lock);
      [   52.642537] c0   lock(udc_lock);
      [   52.642539] c1
                            *** DEADLOCK ***
      [   52.642543] c1 1 lock held by usb ffs open/2808:
      [   52.642555] c0  #0:  (ffs_lock){+.+.+.}, at: [<ffffffc00066b244>]
      		ffs_data_clear+0x30/0x140
      [   52.642557] c1 stack backtrace:
      [   52.642563] c1 CPU: 1 PID: 2808 Comm: usb ffs open Tainted: G
      [   52.642565] c1 Hardware name: Spreadtrum SP9860g Board (DT)
      [   52.642568] c1 Call trace:
      [   52.642573] c1 [<ffffffc00008b430>] dump_backtrace+0x0/0x170
      [   52.642577] c1 [<ffffffc00008b5c0>] show_stack+0x20/0x28
      [   52.642583] c1 [<ffffffc000422694>] dump_stack+0xa8/0xe0
      [   52.642587] c1 [<ffffffc00011e548>] print_circular_bug+0x1fc/0x2e4
      [   52.642591] c1 [<ffffffc000123454>] __lock_acquire+0x2138/0x2238
      [   52.642595] c1 [<ffffffc000123b54>] lock_acquire+0xe4/0x298
      [   52.642599] c1 [<ffffffc000aaf6e8>] mutex_lock_nested+0x7c/0x3cc
      [   52.642604] c1 [<ffffffc00065aeec>] usb_gadget_unregister_driver+0x3c/0xc8
      [   52.642608] c1 [<ffffffc00065995c>] unregister_gadget_item+0x28/0x44
      [   52.642613] c1 [<ffffffc00066b34c>] ffs_data_clear+0x138/0x140
      [   52.642618] c1 [<ffffffc00066b374>] ffs_data_reset+0x20/0x6c
      [   52.642621] c1 [<ffffffc00066efd0>] ffs_data_closed+0xac/0x12c
      [   52.642625] c1 [<ffffffc00066f070>] ffs_ep0_release+0x20/0x2c
      [   52.642629] c1 [<ffffffc00023dbe4>] __fput+0xb0/0x1f4
      [   52.642633] c1 [<ffffffc00023dd9c>] ____fput+0x20/0x2c
      [   52.642636] c1 [<ffffffc0000ee944>] task_work_run+0xb4/0xe8
      [   52.642640] c1 [<ffffffc0000cd45c>] do_exit+0x360/0xb9c
      [   52.642644] c1 [<ffffffc0000cf228>] do_group_exit+0x4c/0xb0
      [   52.642647] c1 [<ffffffc0000dd3c8>] get_signal+0x380/0x89c
      [   52.642651] c1 [<ffffffc00008a8f0>] do_signal+0x154/0x518
      [   52.642656] c1 [<ffffffc00008af00>] do_notify_resume+0x70/0x78
      [   52.642659] c1 [<ffffffc000085ee8>] work_pending+0x1c/0x20
      Acked-by: default avatarMichal Nazarewicz <mina86@mina86.com>
      Signed-off-by: default avatarBaolin Wang <baolin.wang@linaro.org>
      Signed-off-by: default avatarFelipe Balbi <felipe.balbi@linux.intel.com>
      Cc: Jerry Zhang <zhangjerry@google.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      d2da8d39
    • Baoquan He's avatar
      x86/mm: Fix boot crash caused by incorrect loop count calculation in sync_global_pgds() · ed96148d
      Baoquan He authored
      commit fc5f9d5f upstream.
      
      Jeff Moyer reported that on his system with two memory regions 0~64G and
      1T~1T+192G, and kernel option "memmap=192G!1024G" added, enabling KASLR
      will make the system hang intermittently during boot. While adding 'nokaslr'
      won't.
      
      The back trace is:
      
       Oops: 0000 [#1] SMP
      
       RIP: memcpy_erms()
       [ .... ]
       Call Trace:
        pmem_rw_page()
        bdev_read_page()
        do_mpage_readpage()
        mpage_readpages()
        blkdev_readpages()
        __do_page_cache_readahead()
        force_page_cache_readahead()
        page_cache_sync_readahead()
        generic_file_read_iter()
        blkdev_read_iter()
        __vfs_read()
        vfs_read()
        SyS_read()
        entry_SYSCALL_64_fastpath()
      
      This crash happens because the for loop count calculation in sync_global_pgds()
      is not correct. When a mapping area crosses PGD entries, we should
      calculate the starting address of region which next PGD covers and assign
      it to next for loop count, but not add PGDIR_SIZE directly. The old
      code works right only if the mapping area is an exact multiple of PGDIR_SIZE,
      otherwize the end region could be skipped so that it can't be synchronized
      to all other processes from kernel PGD init_mm.pgd.
      
      In Jeff's system, emulated pmem area [1024G, 1216G) is smaller than
      PGDIR_SIZE. While 'nokaslr' works because PAGE_OFFSET is 1T aligned, it
      makes this area be mapped inside one PGD entry. With KASLR enabled,
      this area could cross two PGD entries, then the next PGD entry won't
      be synced to all other processes. That is why we saw empty PGD.
      
      Fix it.
      Reported-by: default avatarJeff Moyer <jmoyer@redhat.com>
      Signed-off-by: default avatarBaoquan He <bhe@redhat.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Brian Gerst <brgerst@gmail.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Dave Young <dyoung@redhat.com>
      Cc: Denys Vlasenko <dvlasenk@redhat.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Jinbum Park <jinb.park7@gmail.com>
      Cc: Josh Poimboeuf <jpoimboe@redhat.com>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Garnier <thgarnie@google.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Yasuaki Ishimatsu <yasu.isimatu@gmail.com>
      Cc: Yinghai Lu <yinghai@kernel.org>
      Link: http://lkml.kernel.org/r/1493864747-8506-1-git-send-email-bhe@redhat.comSigned-off-by: default avatarIngo Molnar <mingo@kernel.org>
      Signed-off-by: default avatarDan Williams <dan.j.williams@intel.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      ed96148d
    • Vallish Vaidyeshwara's avatar
      dm thin: do not queue freed thin mapping for next stage processing · 1c0fa383
      Vallish Vaidyeshwara authored
      commit 00a0ea33 upstream.
      
      process_prepared_discard_passdown_pt1() should cleanup
      dm_thin_new_mapping in cases of error.
      
      dm_pool_inc_data_range() can fail trying to get a block reference:
      
      metadata operation 'dm_pool_inc_data_range' failed: error = -61
      
      When dm_pool_inc_data_range() fails, dm thin aborts current metadata
      transaction and marks pool as PM_READ_ONLY. Memory for thin mapping
      is released as well. However, current thin mapping will be queued
      onto next stage as part of queue_passdown_pt2() or passdown_endio().
      This dangling thin mapping memory when processed and accessed in
      next stage will lead to device mapper crashing.
      
      Code flow without fix:
      -> process_prepared_discard_passdown_pt1(m)
         -> dm_thin_remove_range()
         -> discard passdown
            --> passdown_endio(m) queues m onto next stage
         -> dm_pool_inc_data_range() fails, frees memory m
                  but does not remove it from next stage queue
      
      -> process_prepared_discard_passdown_pt2(m)
         -> processes freed memory m and crashes
      
      One such stack:
      
      Call Trace:
      [<ffffffffa037a46f>] dm_cell_release_no_holder+0x2f/0x70 [dm_bio_prison]
      [<ffffffffa039b6dc>] cell_defer_no_holder+0x3c/0x80 [dm_thin_pool]
      [<ffffffffa039b88b>] process_prepared_discard_passdown_pt2+0x4b/0x90 [dm_thin_pool]
      [<ffffffffa0399611>] process_prepared+0x81/0xa0 [dm_thin_pool]
      [<ffffffffa039e735>] do_worker+0xc5/0x820 [dm_thin_pool]
      [<ffffffff8152bf54>] ? __schedule+0x244/0x680
      [<ffffffff81087e72>] ? pwq_activate_delayed_work+0x42/0xb0
      [<ffffffff81089f53>] process_one_work+0x153/0x3f0
      [<ffffffff8108a71b>] worker_thread+0x12b/0x4b0
      [<ffffffff8108a5f0>] ? rescuer_thread+0x350/0x350
      [<ffffffff8108fd6a>] kthread+0xca/0xe0
      [<ffffffff8108fca0>] ? kthread_park+0x60/0x60
      [<ffffffff81530b45>] ret_from_fork+0x25/0x30
      
      The fix is to first take the block ref count for discarded block and
      then do a passdown discard of this block. If block ref count fails,
      then bail out aborting current metadata transaction, mark pool as
      PM_READ_ONLY and also free current thin mapping memory (existing error
      handling code) without queueing this thin mapping onto next stage of
      processing. If block ref count succeeds, then passdown discard of this
      block. Discard callback of passdown_endio() will queue this thin mapping
      onto next stage of processing.
      
      Code flow with fix:
      -> process_prepared_discard_passdown_pt1(m)
         -> dm_thin_remove_range()
         -> dm_pool_inc_data_range()
            --> if fails, free memory m and bail out
         -> discard passdown
            --> passdown_endio(m) queues m onto next stage
      Reviewed-by: default avatarEduardo Valentin <eduval@amazon.com>
      Reviewed-by: default avatarCristian Gafton <gafton@amazon.com>
      Reviewed-by: default avatarAnchal Agarwal <anchalag@amazon.com>
      Signed-off-by: default avatarVallish Vaidyeshwara <vallish@amazon.com>
      Reviewed-by: default avatarJoe Thornber <ejt@redhat.com>
      Signed-off-by: default avatarMike Snitzer <snitzer@redhat.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      1c0fa383
    • Deepak Rawat's avatar
      drm/vmwgfx: Free hash table allocated by cmdbuf managed res mgr · 466877f2
      Deepak Rawat authored
      commit 82fcee52 upstream.
      
      The hash table created during vmw_cmdbuf_res_man_create was
      never freed. This causes memory leak in context creation.
      Added the corresponding drm_ht_remove in vmw_cmdbuf_res_man_destroy.
      
      Tested for memory leak by running piglit overnight and kernel
      memory is not inflated which earlier was.
      Signed-off-by: default avatarDeepak Rawat <drawat@vmware.com>
      Reviewed-by: default avatarSinclair Yeh <syeh@vmware.com>
      Signed-off-by: default avatarThomas Hellstrom <thellstrom@vmware.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      466877f2
    • Bartosz Golaszewski's avatar
      gpiolib: fix filtering out unwanted events · 78c4244f
      Bartosz Golaszewski authored
      commit ad537b82 upstream.
      
      GPIOEVENT_REQUEST_BOTH_EDGES is not a single flag, but a binary OR of
      GPIOEVENT_REQUEST_RISING_EDGE and GPIOEVENT_REQUEST_FALLING_EDGE.
      
      The expression 'le->eflags & GPIOEVENT_REQUEST_BOTH_EDGES' we'll get
      evaluated to true even if only one event type was requested.
      
      Fix it by checking both RISING & FALLING flags explicitly.
      
      Fixes: 61f922db ("gpio: userspace ABI for reading GPIO line events")
      Signed-off-by: default avatarBartosz Golaszewski <brgl@bgdev.pl>
      Signed-off-by: default avatarLinus Walleij <linus.walleij@linaro.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      78c4244f
    • Trond Myklebust's avatar
      NFSv4.1: Fix a race in nfs4_proc_layoutget · cb2c6fdf
      Trond Myklebust authored
      commit bd171930 upstream.
      
      If the task calling layoutget is signalled, then it is possible for the
      calls to nfs4_sequence_free_slot() and nfs4_layoutget_prepare() to race,
      in which case we leak a slot.
      The fix is to move the call to nfs4_sequence_free_slot() into the
      nfs4_layoutget_release() so that it gets called at task teardown time.
      
      Fixes: 2e80dbe7 ("NFSv4.1: Close callback races for OPEN, LAYOUTGET...")
      Signed-off-by: default avatarTrond Myklebust <trond.myklebust@primarydata.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      cb2c6fdf
    • Hui Wang's avatar
      ALSA: hda - set input_path bitmap to zero after moving it to new place · 7d0e27fe
      Hui Wang authored
      commit a8f20fd2 upstream.
      
      Recently we met a problem, the codec has valid adcs and input pins,
      and they can form valid input paths, but the driver does not build
      valid controls for them like "Mic boost", "Capture Volume" and
      "Capture Switch".
      
      Through debugging, I found the driver needs to shrink the invalid
      adcs and input paths for this machine, so it will move the whole
      column bitmap value to the previous column, after moving it, the
      driver forgets to set the original column bitmap value to zero, as a
      result, the driver will invalidate the path whose index value is the
      original colume bitmap value. After executing this function, all
      valid input paths are invalidated by a mistake, there are no any
      valid input paths, so the driver won't build controls for them.
      
      Fixes: 3a65bcdc ("ALSA: hda - Fix inconsistent input_paths after ADC reduction")
      Signed-off-by: default avatarHui Wang <hui.wang@canonical.com>
      Signed-off-by: default avatarTakashi Iwai <tiwai@suse.de>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      7d0e27fe
    • Takashi Iwai's avatar
      ALSA: hda - Fix endless loop of codec configure · 093750c3
      Takashi Iwai authored
      commit d94815f9 upstream.
      
      azx_codec_configure() loops over the codecs found on the given
      controller via a linked list.  The code used to work in the past, but
      in the current version, this may lead to an endless loop when a codec
      binding returns an error.
      
      The culprit is that the snd_hda_codec_configure() unregisters the
      device upon error, and this eventually deletes the given codec object
      from the bus.  Since the list is initialized via list_del_init(), the
      next object points to the same device itself.  This behavior change
      was introduced at splitting the HD-audio code code, and forgotten to
      adapt it here.
      
      For fixing this bug, just use a *_safe() version of list iteration.
      
      Fixes: d068ebc2 ("ALSA: hda - Move some codes up to hdac_bus struct")
      Reported-by: default avatarDaniel Vetter <daniel.vetter@ffwll.ch>
      Signed-off-by: default avatarTakashi Iwai <tiwai@suse.de>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      093750c3
    • Paul Burton's avatar
      MIPS: Fix IRQ tracing & lockdep when rescheduling · dad3135e
      Paul Burton authored
      commit d8550860 upstream.
      
      When the scheduler sets TIF_NEED_RESCHED & we call into the scheduler
      from arch/mips/kernel/entry.S we disable interrupts. This is true
      regardless of whether we reach work_resched from syscall_exit_work,
      resume_userspace or by looping after calling schedule(). Although we
      disable interrupts in these paths we don't call trace_hardirqs_off()
      before calling into C code which may acquire locks, and we therefore
      leave lockdep with an inconsistent view of whether interrupts are
      disabled or not when CONFIG_PROVE_LOCKING & CONFIG_DEBUG_LOCKDEP are
      both enabled.
      
      Without tracing this interrupt state lockdep will print warnings such
      as the following once a task returns from a syscall via
      syscall_exit_partial with TIF_NEED_RESCHED set:
      
      [   49.927678] ------------[ cut here ]------------
      [   49.934445] WARNING: CPU: 0 PID: 1 at kernel/locking/lockdep.c:3687 check_flags.part.41+0x1dc/0x1e8
      [   49.946031] DEBUG_LOCKS_WARN_ON(current->hardirqs_enabled)
      [   49.946355] CPU: 0 PID: 1 Comm: init Not tainted 4.10.0-00439-gc9fd5d362289-dirty #197
      [   49.963505] Stack : 0000000000000000 ffffffff81bb5d6a 0000000000000006 ffffffff801ce9c4
      [   49.974431]         0000000000000000 0000000000000000 0000000000000000 000000000000004a
      [   49.985300]         ffffffff80b7e487 ffffffff80a24498 a8000000ff160000 ffffffff80ede8b8
      [   49.996194]         0000000000000001 0000000000000000 0000000000000000 0000000077c8030c
      [   50.007063]         000000007fd8a510 ffffffff801cd45c 0000000000000000 a8000000ff127c88
      [   50.017945]         0000000000000000 ffffffff801cf928 0000000000000001 ffffffff80a24498
      [   50.028827]         0000000000000000 0000000000000001 0000000000000000 0000000000000000
      [   50.039688]         0000000000000000 a8000000ff127bd0 0000000000000000 ffffffff805509bc
      [   50.050575]         00000000140084e0 0000000000000000 0000000000000000 0000000000040a00
      [   50.061448]         0000000000000000 ffffffff8010e1b0 0000000000000000 ffffffff805509bc
      [   50.072327]         ...
      [   50.076087] Call Trace:
      [   50.079869] [<ffffffff8010e1b0>] show_stack+0x80/0xa8
      [   50.086577] [<ffffffff805509bc>] dump_stack+0x10c/0x190
      [   50.093498] [<ffffffff8015dde0>] __warn+0xf0/0x108
      [   50.099889] [<ffffffff8015de34>] warn_slowpath_fmt+0x3c/0x48
      [   50.107241] [<ffffffff801c15b4>] check_flags.part.41+0x1dc/0x1e8
      [   50.114961] [<ffffffff801c239c>] lock_is_held_type+0x8c/0xb0
      [   50.122291] [<ffffffff809461b8>] __schedule+0x8c0/0x10f8
      [   50.129221] [<ffffffff80946a60>] schedule+0x30/0x98
      [   50.135659] [<ffffffff80106278>] work_resched+0x8/0x34
      [   50.142397] ---[ end trace 0cb4f6ef5b99fe21 ]---
      [   50.148405] possible reason: unannotated irqs-off.
      [   50.154600] irq event stamp: 400463
      [   50.159566] hardirqs last  enabled at (400463): [<ffffffff8094edc8>] _raw_spin_unlock_irqrestore+0x40/0xa8
      [   50.171981] hardirqs last disabled at (400462): [<ffffffff8094eb98>] _raw_spin_lock_irqsave+0x30/0xb0
      [   50.183897] softirqs last  enabled at (400450): [<ffffffff8016580c>] __do_softirq+0x4ac/0x6a8
      [   50.195015] softirqs last disabled at (400425): [<ffffffff80165e78>] irq_exit+0x110/0x128
      
      Fix this by using the TRACE_IRQS_OFF macro to call trace_hardirqs_off()
      when CONFIG_TRACE_IRQFLAGS is enabled. This is done before invoking
      schedule() following the work_resched label because:
      
       1) Interrupts are disabled regardless of the path we take to reach
          work_resched() & schedule().
      
       2) Performing the tracing here avoids the need to do it in paths which
          disable interrupts but don't call out to C code before hitting a
          path which uses the RESTORE_SOME macro that will call
          trace_hardirqs_on() or trace_hardirqs_off() as appropriate.
      
      We call trace_hardirqs_on() using the TRACE_IRQS_ON macro before calling
      syscall_trace_leave() for similar reasons, ensuring that lockdep has a
      consistent view of state after we re-enable interrupts.
      Signed-off-by: default avatarPaul Burton <paul.burton@imgtec.com>
      Fixes: 1da177e4 ("Linux-2.6.12-rc2")
      Cc: linux-mips@linux-mips.org
      Patchwork: https://patchwork.linux-mips.org/patch/15385/Signed-off-by: default avatarRalf Baechle <ralf@linux-mips.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      dad3135e
    • Paul Burton's avatar
      MIPS: pm-cps: Drop manual cache-line alignment of ready_count · e9e24faf
      Paul Burton authored
      commit 161c51cc upstream.
      
      We allocate memory for a ready_count variable per-CPU, which is accessed
      via a cached non-coherent TLB mapping to perform synchronisation between
      threads within the core using LL/SC instructions. In order to ensure
      that the variable is contained within its own data cache line we
      allocate 2 lines worth of memory & align the resulting pointer to a line
      boundary. This is however unnecessary, since kmalloc is guaranteed to
      return memory which is at least cache-line aligned (see
      ARCH_DMA_MINALIGN). Stop the redundant manual alignment.
      
      Besides cleaning up the code & avoiding needless work, this has the side
      effect of avoiding an arithmetic error found by Bryan on 64 bit systems
      due to the 32 bit size of the former dlinesz. This led the ready_count
      variable to have its upper 32b cleared erroneously for MIPS64 kernels,
      causing problems when ready_count was later used on MIPS64 via cpuidle.
      Signed-off-by: default avatarPaul Burton <paul.burton@imgtec.com>
      Fixes: 3179d37e ("MIPS: pm-cps: add PM state entry code for CPS systems")
      Reported-by: default avatarBryan O'Donoghue <bryan.odonoghue@imgtec.com>
      Reviewed-by: default avatarBryan O'Donoghue <bryan.odonoghue@imgtec.com>
      Tested-by: default avatarBryan O'Donoghue <bryan.odonoghue@imgtec.com>
      Cc: linux-mips@linux-mips.org
      Patchwork: https://patchwork.linux-mips.org/patch/15383/Signed-off-by: default avatarRalf Baechle <ralf@linux-mips.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      e9e24faf
    • James Hogan's avatar
      MIPS: Avoid accidental raw backtrace · f7d3d40e
      James Hogan authored
      commit 85423636 upstream.
      
      Since commit 81a76d71 ("MIPS: Avoid using unwind_stack() with
      usermode") show_backtrace() invokes the raw backtracer when
      cp0_status & ST0_KSU indicates user mode to fix issues on EVA kernels
      where user and kernel address spaces overlap.
      
      However this is used by show_stack() which creates its own pt_regs on
      the stack and leaves cp0_status uninitialised in most of the code paths.
      This results in the non deterministic use of the raw back tracer
      depending on the previous stack content.
      
      show_stack() deals exclusively with kernel mode stacks anyway, so
      explicitly initialise regs.cp0_status to KSU_KERNEL (i.e. 0) to ensure
      we get a useful backtrace.
      
      Fixes: 81a76d71 ("MIPS: Avoid using unwind_stack() with usermode")
      Signed-off-by: default avatarJames Hogan <james.hogan@imgtec.com>
      Cc: linux-mips@linux-mips.org
      Patchwork: https://patchwork.linux-mips.org/patch/16656/Signed-off-by: default avatarRalf Baechle <ralf@linux-mips.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      f7d3d40e
    • Karl Beldan's avatar
      MIPS: head: Reorder instructions missing a delay slot · 3d4ac49a
      Karl Beldan authored
      commit 25d8b92e upstream.
      
      In this sequence the 'move' is assumed in the delay slot of the 'beq',
      but head.S is in reorder mode and the former gets pushed one 'nop'
      farther by the assembler.
      
      The corrected behavior made booting with an UHI supplied dtb erratic.
      
      Fixes: 15f37e15 ("MIPS: store the appended dtb address in a variable")
      Signed-off-by: default avatarKarl Beldan <karl.beldan+oss@gmail.com>
      Reviewed-by: default avatarJames Hogan <james.hogan@imgtec.com>
      Cc: Jonas Gorski <jogo@openwrt.org>
      Cc: linux-mips@linux-mips.org
      Cc: linux-kernel@vger.kernel.org
      Patchwork: https://patchwork.linux-mips.org/patch/16614/Signed-off-by: default avatarRalf Baechle <ralf@linux-mips.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      3d4ac49a
    • David Rientjes's avatar
      mm, swap_cgroup: reschedule when neeed in swap_cgroup_swapoff() · b1355226
      David Rientjes authored
      commit 460bcec8 upstream.
      
      We got need_resched() warnings in swap_cgroup_swapoff() because
      swap_cgroup_ctrl[type].length is particularly large.
      
      Reschedule when needed.
      
      Link: http://lkml.kernel.org/r/alpine.DEB.2.10.1704061315270.80559@chino.kir.corp.google.comSigned-off-by: default avatarDavid Rientjes <rientjes@google.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Cc: Ben Hutchings <ben.hutchings@codethink.co.uk>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      b1355226
    • Russell Currey's avatar
      drm/ast: Handle configuration without P2A bridge · dbc80836
      Russell Currey authored
      commit 71f677a9 upstream.
      
      The ast driver configures a window to enable access into BMC
      memory space in order to read some configuration registers.
      
      If this window is disabled, which it can be from the BMC side,
      the ast driver can't function.
      
      Closing this window is a necessity for security if a machine's
      host side and BMC side are controlled by different parties;
      i.e. a cloud provider offering machines "bare metal".
      
      A recent patch went in to try to check if that window is open
      but it does so by trying to access the registers in question
      and testing if the result is 0xffffffff.
      
      This method will trigger a PCIe error when the window is closed
      which on some systems will be fatal (it will trigger an EEH
      for example on POWER which will take out the device).
      
      This patch improves this in two ways:
      
       - First, if the firmware has put properties in the device-tree
      containing the relevant configuration information, we use these.
      
       - Otherwise, a bit in one of the SCU scratch registers (which
      are readable via the VGA register space and writeable by the BMC)
      will indicate if the BMC has closed the window. This bit has been
      defined by Y.C Chen from Aspeed.
      
      If the window is closed and the configuration isn't available from
      the device-tree, some sane defaults are used. Those defaults are
      hopefully sufficient for standard video modes used on a server.
      Signed-off-by: default avatarRussell Currey <ruscur@russell.cc>
      Acked-by: default avatarJoel Stanley <joel@jms.id.au>
      Signed-off-by: default avatarBenjamin Herrenschmidt <benh@kernel.crashing.org>
      Signed-off-by: default avatarDave Airlie <airlied@redhat.com>
      Cc: Ben Hutchings <ben.hutchings@codethink.co.uk>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      dbc80836
    • Juergen Gross's avatar
      xen/blkback: don't use xen_blkif_get() in xen-blkback kthread · 8dc9f9de
      Juergen Gross authored
      commit a24fa22c upstream.
      
      There is no need to use xen_blkif_get()/xen_blkif_put() in the kthread
      of xen-blkback. Thread stopping is synchronous and using the blkif
      reference counting in the kthread will avoid to ever let the reference
      count drop to zero at the end of an I/O running concurrent to
      disconnecting and multiple rings.
      
      Setting ring->xenblkd to NULL after stopping the kthread isn't needed
      as the kthread does this already.
      Signed-off-by: default avatarJuergen Gross <jgross@suse.com>
      Tested-by: default avatarSteven Haigh <netwiz@crc.id.au>
      Acked-by: default avatarRoger Pau Monné <roger.pau@citrix.com>
      Signed-off-by: default avatarKonrad Rzeszutek Wilk <konrad.wilk@oracle.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      8dc9f9de
    • Kinglong Mee's avatar
      NFSv4.x/callback: Create the callback service through svc_create_pooled · 4ebe28d2
      Kinglong Mee authored
      commit df807fff upstream.
      
      As the comments for svc_set_num_threads() said,
      " Destroying threads relies on the service threads filling in
      rqstp->rq_task, which only the nfs ones do.  Assumes the serv
      has been created using svc_create_pooled()."
      
      If creating service through svc_create(), the svc_pool_map_put()
      will be called in svc_destroy(), but the pool map isn't used.
      So that, the reference of pool map will be drop, the next using
      of pool map will get a zero npools.
      
      [  137.992130] divide error: 0000 [#1] SMP
      [  137.992148] Modules linked in: nfsd(E) nfsv4 nfs fscache fuse tun bridge stp llc ip_set nfnetlink vmw_vsock_vmci_transport vsock snd_seq_midi snd_seq_midi_event vmw_balloon coretemp crct10dif_pclmul crc32_pclmul ppdev ghash_clmulni_intel intel_rapl_perf joydev snd_ens1371 gameport snd_ac97_codec ac97_bus snd_seq snd_pcm snd_rawmidi snd_timer snd_seq_device snd soundcore parport_pc parport nfit acpi_cpufreq tpm_tis tpm_tis_core tpm vmw_vmci i2c_piix4 shpchp auth_rpcgss nfs_acl lockd(E) grace sunrpc(E) xfs libcrc32c vmwgfx drm_kms_helper ttm crc32c_intel drm e1000 mptspi scsi_transport_spi serio_raw mptscsih mptbase ata_generic pata_acpi [last unloaded: nfsd]
      [  137.992336] CPU: 0 PID: 4514 Comm: rpc.nfsd Tainted: G            E   4.11.0-rc8+ #536
      [  137.992777] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 07/02/2015
      [  137.993757] task: ffff955984101d00 task.stack: ffff9873c2604000
      [  137.994231] RIP: 0010:svc_pool_for_cpu+0x2b/0x80 [sunrpc]
      [  137.994768] RSP: 0018:ffff9873c2607c18 EFLAGS: 00010246
      [  137.995227] RAX: 0000000000000000 RBX: ffff95598376f000 RCX: 0000000000000002
      [  137.995673] RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff9559944aec00
      [  137.996156] RBP: ffff9873c2607c18 R08: ffff9559944aec28 R09: 0000000000000000
      [  137.996609] R10: 0000000001080002 R11: 0000000000000000 R12: ffff95598376f010
      [  137.997063] R13: ffff95598376f018 R14: ffff9559944aec28 R15: ffff9559944aec00
      [  137.997584] FS:  00007f755529eb40(0000) GS:ffff9559bb600000(0000) knlGS:0000000000000000
      [  137.998048] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      [  137.998548] CR2: 000055f3aecd9660 CR3: 0000000084290000 CR4: 00000000001406f0
      [  137.999052] Call Trace:
      [  137.999517]  svc_xprt_do_enqueue+0xef/0x260 [sunrpc]
      [  138.000028]  svc_xprt_received+0x47/0x90 [sunrpc]
      [  138.000487]  svc_add_new_perm_xprt+0x76/0x90 [sunrpc]
      [  138.000981]  svc_addsock+0x14b/0x200 [sunrpc]
      [  138.001424]  ? recalc_sigpending+0x1b/0x50
      [  138.001860]  ? __getnstimeofday64+0x41/0xd0
      [  138.002346]  ? do_gettimeofday+0x29/0x90
      [  138.002779]  write_ports+0x255/0x2c0 [nfsd]
      [  138.003202]  ? _copy_from_user+0x4e/0x80
      [  138.003676]  ? write_recoverydir+0x100/0x100 [nfsd]
      [  138.004098]  nfsctl_transaction_write+0x48/0x80 [nfsd]
      [  138.004544]  __vfs_write+0x37/0x160
      [  138.004982]  ? selinux_file_permission+0xd7/0x110
      [  138.005401]  ? security_file_permission+0x3b/0xc0
      [  138.005865]  vfs_write+0xb5/0x1a0
      [  138.006267]  SyS_write+0x55/0xc0
      [  138.006654]  entry_SYSCALL_64_fastpath+0x1a/0xa9
      [  138.007071] RIP: 0033:0x7f7554b9dc30
      [  138.007437] RSP: 002b:00007ffc9f92c788 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
      [  138.007807] RAX: ffffffffffffffda RBX: 0000000000000002 RCX: 00007f7554b9dc30
      [  138.008168] RDX: 0000000000000002 RSI: 00005640cd536640 RDI: 0000000000000003
      [  138.008573] RBP: 00007ffc9f92c780 R08: 0000000000000001 R09: 0000000000000002
      [  138.008918] R10: 0000000000000064 R11: 0000000000000246 R12: 0000000000000004
      [  138.009254] R13: 00005640cdbf77a0 R14: 00005640cdbf7720 R15: 00007ffc9f92c238
      [  138.009610] Code: 0f 1f 44 00 00 48 8b 87 98 00 00 00 55 48 89 e5 48 83 78 08 00 74 10 8b 05 07 42 02 00 83 f8 01 74 40 83 f8 02 74 19 31 c0 31 d2 <f7> b7 88 00 00 00 5d 89 d0 48 c1 e0 07 48 03 87 90 00 00 00 c3
      [  138.010664] RIP: svc_pool_for_cpu+0x2b/0x80 [sunrpc] RSP: ffff9873c2607c18
      [  138.011061] ---[ end trace b3468224cafa7d11 ]---
      Signed-off-by: default avatarKinglong Mee <kinglongmee@gmail.com>
      Signed-off-by: default avatarJ. Bruce Fields <bfields@redhat.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      4ebe28d2
    • Kinglong Mee's avatar
      NFSv4: fix a reference leak caused WARNING messages · 955f270b
      Kinglong Mee authored
      commit 366a1569 upstream.
      
      Because nfs4_opendata_access() has close the state when access is denied,
      so the state isn't leak.
      Rather than revert the commit a974deee, I'd like clean the strange state close.
      
      [ 1615.094218] ------------[ cut here ]------------
      [ 1615.094607] WARNING: CPU: 0 PID: 23702 at lib/list_debug.c:31 __list_add_valid+0x8e/0xa0
      [ 1615.094913] list_add double add: new=ffff9d7901d9f608, prev=ffff9d7901d9f608, next=ffff9d7901ee8dd0.
      [ 1615.095458] Modules linked in: nfsv4(E) nfs(E) nfsd(E) tun bridge stp llc fuse ip_set nfnetlink vmw_vsock_vmci_transport vsock f2fs snd_seq_midi snd_seq_midi_event fscrypto coretemp ppdev crct10dif_pclmul crc32_pclmul ghash_clmulni_intel intel_rapl_perf vmw_balloon snd_ens1371 joydev gameport snd_ac97_codec ac97_bus snd_seq snd_pcm snd_rawmidi snd_timer snd_seq_device snd soundcore nfit parport_pc parport acpi_cpufreq tpm_tis tpm_tis_core tpm i2c_piix4 vmw_vmci shpchp auth_rpcgss nfs_acl lockd(E) grace sunrpc(E) xfs libcrc32c vmwgfx drm_kms_helper ttm drm crc32c_intel mptspi e1000 serio_raw scsi_transport_spi mptscsih mptbase ata_generic pata_acpi fjes [last unloaded: nfs]
      [ 1615.097663] CPU: 0 PID: 23702 Comm: fstest Tainted: G        W   E   4.11.0-rc1+ #517
      [ 1615.098015] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 07/02/2015
      [ 1615.098807] Call Trace:
      [ 1615.099183]  dump_stack+0x63/0x86
      [ 1615.099578]  __warn+0xcb/0xf0
      [ 1615.099967]  warn_slowpath_fmt+0x5f/0x80
      [ 1615.100370]  __list_add_valid+0x8e/0xa0
      [ 1615.100760]  nfs4_put_state_owner+0x75/0xc0 [nfsv4]
      [ 1615.101136]  __nfs4_close+0x109/0x140 [nfsv4]
      [ 1615.101524]  nfs4_close_state+0x15/0x20 [nfsv4]
      [ 1615.101949]  nfs4_close_context+0x21/0x30 [nfsv4]
      [ 1615.102691]  __put_nfs_open_context+0xb8/0x110 [nfs]
      [ 1615.103155]  put_nfs_open_context+0x10/0x20 [nfs]
      [ 1615.103586]  nfs4_file_open+0x13b/0x260 [nfsv4]
      [ 1615.103978]  do_dentry_open+0x20a/0x2f0
      [ 1615.104369]  ? nfs4_copy_file_range+0x30/0x30 [nfsv4]
      [ 1615.104739]  vfs_open+0x4c/0x70
      [ 1615.105106]  ? may_open+0x5a/0x100
      [ 1615.105469]  path_openat+0x623/0x1420
      [ 1615.105823]  do_filp_open+0x91/0x100
      [ 1615.106174]  ? __alloc_fd+0x3f/0x170
      [ 1615.106568]  do_sys_open+0x130/0x220
      [ 1615.106920]  ? __put_cred+0x3d/0x50
      [ 1615.107256]  SyS_open+0x1e/0x20
      [ 1615.107588]  entry_SYSCALL_64_fastpath+0x1a/0xa9
      [ 1615.107922] RIP: 0033:0x7fab599069b0
      [ 1615.108247] RSP: 002b:00007ffcf0600d78 EFLAGS: 00000246 ORIG_RAX: 0000000000000002
      [ 1615.108575] RAX: ffffffffffffffda RBX: 00007fab59bcfae0 RCX: 00007fab599069b0
      [ 1615.108896] RDX: 0000000000000200 RSI: 0000000000000200 RDI: 00007ffcf060255e
      [ 1615.109211] RBP: 0000000000040010 R08: 0000000000000000 R09: 0000000000000016
      [ 1615.109515] R10: 00000000000006a1 R11: 0000000000000246 R12: 0000000000041000
      [ 1615.109806] R13: 0000000000040010 R14: 0000000000001000 R15: 0000000000002710
      [ 1615.110152] ---[ end trace 96ed63b1306bf2f3 ]---
      
      Fixes: a974deee ("NFSv4: Fix memory and state leak in...")
      Signed-off-by: default avatarKinglong Mee <kinglongmee@gmail.com>
      Signed-off-by: default avatarAnna Schumaker <Anna.Schumaker@Netapp.com>
      Cc: Trond Myklebust <trond.myklebust@primarydata.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      955f270b
    • Eric Leblond's avatar
      netfilter: synproxy: fix conntrackd interaction · b89bd0c7
      Eric Leblond authored
      commit 87e94dbc upstream.
      
      This patch fixes the creation of connection tracking entry from
      netlink when synproxy is used. It was missing the addition of
      the synproxy extension.
      
      This was causing kernel crashes when a conntrack entry created by
      conntrackd was used after the switch of traffic from active node
      to the passive node.
      Signed-off-by: default avatarEric Leblond <eric@regit.org>
      Signed-off-by: default avatarPablo Neira Ayuso <pablo@netfilter.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      b89bd0c7
    • Eric Dumazet's avatar
      netfilter: xt_TCPMSS: add more sanity tests on tcph->doff · ced7689b
      Eric Dumazet authored
      commit 2638fd0f upstream.
      
      Denys provided an awesome KASAN report pointing to an use
      after free in xt_TCPMSS
      
      I have provided three patches to fix this issue, either in xt_TCPMSS or
      in xt_tcpudp.c. It seems xt_TCPMSS patch has the smallest possible
      impact.
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Reported-by: default avatarDenys Fedoryshchenko <nuclearcat@nuclearcat.com>
      Signed-off-by: default avatarPablo Neira Ayuso <pablo@netfilter.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      ced7689b
    • Serhey Popovych's avatar
      rtnetlink: add IFLA_GROUP to ifla_policy · 8e231639
      Serhey Popovych authored
      
      [ Upstream commit db833d40 ]
      
      Network interface groups support added while ago, however
      there is no IFLA_GROUP attribute description in policy
      and netlink message size calculations until now.
      
      Add IFLA_GROUP attribute to the policy.
      
      Fixes: cbda10fa ("net_device: add support for network device groups")
      Signed-off-by: default avatarSerhey Popovych <serhe.popovych@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      8e231639
    • Serhey Popovych's avatar
      ipv6: Do not leak throw route references · b9ca9b0f
      Serhey Popovych authored
      
      [ Upstream commit 07f61557 ]
      
      While commit 73ba57bf ("ipv6: fix backtracking for throw routes")
      does good job on error propagation to the fib_rules_lookup()
      in fib rules core framework that also corrects throw routes
      handling, it does not solve route reference leakage problem
      happened when we return -EAGAIN to the fib_rules_lookup()
      and leave routing table entry referenced in arg->result.
      
      If rule with matched throw route isn't last matched in the
      list we overwrite arg->result losing reference on throw
      route stored previously forever.
      
      We also partially revert commit ab997ad4 ("ipv6: fix the
      incorrect return value of throw route") since we never return
      routing table entry with dst.error == -EAGAIN when
      CONFIG_IPV6_MULTIPLE_TABLES is on. Also there is no point
      to check for RTF_REJECT flag since it is always set throw
      route.
      
      Fixes: 73ba57bf ("ipv6: fix backtracking for throw routes")
      Signed-off-by: default avatarSerhey Popovych <serhe.popovych@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      b9ca9b0f
    • Bert Kenward's avatar
      sfc: provide dummy definitions of vswitch functions · e4089baa
      Bert Kenward authored
      
      efx_probe_all() calls efx->type->vswitching_probe during probe. For
      SFC4000 (Falcon) NICs this function is not defined, leading to a BUG
      with the top of the call stack similar to:
        ? efx_pci_probe_main+0x29a/0x830
        efx_pci_probe+0x7d3/0xe70
      
      vswitching_restore and vswitching_remove also need to be defined.
      
      Fixed in mainline by:
      commit 5a6681e2 ("sfc: separate out SFC4000 ("Falcon") support into new sfc-falcon driver")
      
      Fixes: 6d8aaaf6 ("sfc: create VEB vswitch and vport above default firmware setup")
      Signed-off-by: default avatarBert Kenward <bkenward@solarflare.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      e4089baa
    • Gao Feng's avatar
      net: 8021q: Fix one possible panic caused by BUG_ON in free_netdev · 08058c25
      Gao Feng authored
      
      [ Upstream commit 9745e362 ]
      
      The register_vlan_device would invoke free_netdev directly, when
      register_vlan_dev failed. It would trigger the BUG_ON in free_netdev
      if the dev was already registered. In this case, the netdev would be
      freed in netdev_run_todo later.
      
      So add one condition check now. Only when dev is not registered, then
      free it directly.
      
      The following is the part coredump when netdev_upper_dev_link failed
      in register_vlan_dev. I removed the lines which are too long.
      
      [  411.237457] ------------[ cut here ]------------
      [  411.237458] kernel BUG at net/core/dev.c:7998!
      [  411.237484] invalid opcode: 0000 [#1] SMP
      [  411.237705]  [last unloaded: 8021q]
      [  411.237718] CPU: 1 PID: 12845 Comm: vconfig Tainted: G            E   4.12.0-rc5+ #6
      [  411.237737] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 07/02/2015
      [  411.237764] task: ffff9cbeb6685580 task.stack: ffffa7d2807d8000
      [  411.237782] RIP: 0010:free_netdev+0x116/0x120
      [  411.237794] RSP: 0018:ffffa7d2807dbdb0 EFLAGS: 00010297
      [  411.237808] RAX: 0000000000000002 RBX: ffff9cbeb6ba8fd8 RCX: 0000000000001878
      [  411.237826] RDX: 0000000000000001 RSI: 0000000000000282 RDI: 0000000000000000
      [  411.237844] RBP: ffffa7d2807dbdc8 R08: 0002986100029841 R09: 0002982100029801
      [  411.237861] R10: 0004000100029980 R11: 0004000100029980 R12: ffff9cbeb6ba9000
      [  411.238761] R13: ffff9cbeb6ba9060 R14: ffff9cbe60f1a000 R15: ffff9cbeb6ba9000
      [  411.239518] FS:  00007fb690d81700(0000) GS:ffff9cbebb640000(0000) knlGS:0000000000000000
      [  411.239949] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      [  411.240454] CR2: 00007f7115624000 CR3: 0000000077cdf000 CR4: 00000000003406e0
      [  411.240936] Call Trace:
      [  411.241462]  vlan_ioctl_handler+0x3f1/0x400 [8021q]
      [  411.241910]  sock_ioctl+0x18b/0x2c0
      [  411.242394]  do_vfs_ioctl+0xa1/0x5d0
      [  411.242853]  ? sock_alloc_file+0xa6/0x130
      [  411.243465]  SyS_ioctl+0x79/0x90
      [  411.243900]  entry_SYSCALL_64_fastpath+0x1e/0xa9
      [  411.244425] RIP: 0033:0x7fb69089a357
      [  411.244863] RSP: 002b:00007ffcd04e0fc8 EFLAGS: 00000202 ORIG_RAX: 0000000000000010
      [  411.245445] RAX: ffffffffffffffda RBX: 00007ffcd04e2884 RCX: 00007fb69089a357
      [  411.245903] RDX: 00007ffcd04e0fd0 RSI: 0000000000008983 RDI: 0000000000000003
      [  411.246527] RBP: 00007ffcd04e0fd0 R08: 0000000000000000 R09: 1999999999999999
      [  411.246976] R10: 000000000000053f R11: 0000000000000202 R12: 0000000000000004
      [  411.247414] R13: 00007ffcd04e1128 R14: 00007ffcd04e2888 R15: 0000000000000001
      [  411.249129] RIP: free_netdev+0x116/0x120 RSP: ffffa7d2807dbdb0
      Signed-off-by: default avatarGao Feng <gfree.wind@vip.163.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      08058c25
    • Wei Wang's avatar
      decnet: always not take dst->__refcnt when inserting dst into hash table · f1a0e7d1
      Wei Wang authored
      
      [ Upstream commit 76371d2e ]
      
      In the existing dn_route.c code, dn_route_output_slow() takes
      dst->__refcnt before calling dn_insert_route() while dn_route_input_slow()
      does not take dst->__refcnt before calling dn_insert_route().
      This makes the whole routing code very buggy.
      In dn_dst_check_expire(), dnrt_free() is called when rt expires. This
      makes the routes inserted by dn_route_output_slow() not able to be
      freed as the refcnt is not released.
      In dn_dst_gc(), dnrt_drop() is called to release rt which could
      potentially cause the dst->__refcnt to be dropped to -1.
      In dn_run_flush(), dst_free() is called to release all the dst. Again,
      it makes the dst inserted by dn_route_output_slow() not able to be
      released and also, it does not wait on the rcu and could potentially
      cause crash in the path where other users still refer to this dst.
      
      This patch makes sure both input and output path do not take
      dst->__refcnt before calling dn_insert_route() and also makes sure
      dnrt_free()/dst_free() is called when removing dst from the hash table.
      The only difference between those 2 calls is that dnrt_free() waits on
      the rcu while dst_free() does not.
      Signed-off-by: default avatarWei Wang <weiwan@google.com>
      Acked-by: default avatarMartin KaFai Lau <kafai@fb.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      f1a0e7d1
    • Maor Dickman's avatar
      net/mlx5e: Fix timestamping capabilities reporting · c7d422d6
      Maor Dickman authored
      
      [ Upstream commit f0b38117 ]
      
      Misuse of (BIT) macro caused to report wrong flags for
      "Hardware Transmit Timestamp Modes" and "Hardware Receive
      Filter Modes"
      
      Fixes: ef9814de ('net/mlx5e: Add HW timestamping (TS) support')
      Signed-off-by: default avatarMaor Dickman <maord@mellanox.com>
      Signed-off-by: default avatarSaeed Mahameed <saeedm@mellanox.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      c7d422d6
    • Eli Cohen's avatar
      net/mlx5: Wait for FW readiness before initializing command interface · 25ff3507
      Eli Cohen authored
      
      [ Upstream commit 6c780a02 ]
      
      Before attempting to initialize the command interface we must wait till
      the fw_initializing bit is clear.
      
      If we fail to meet this condition the hardware will drop our
      configuration, specifically the descriptors page address.  This scenario
      can happen when the firmware is still executing an FLR flow and did not
      finish yet so the driver needs to wait for that to finish.
      
      Fixes: e3297246 ('net/mlx5_core: Wait for FW readiness on startup')
      Signed-off-by: default avatarEli Cohen <eli@mellanox.com>
      Signed-off-by: default avatarSaeed Mahameed <saeedm@mellanox.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      25ff3507
    • Or Gerlitz's avatar
      net/mlx5e: Avoid doing a cleanup call if the profile doesn't have it · 176b9874
      Or Gerlitz authored
      
      [ Upstream commit 31ac9338 ]
      
      The error flow of mlx5e_create_netdev calls the cleanup call
      of the given profile without checking if it exists, fix that.
      
      Currently the VF reps don't register that callback and we crash
      if getting into error -- can be reproduced by the user doing ctrl^C
      while attempting to change the sriov mode from legacy to switchdev.
      
      Fixes: 26e59d80 '(net/mlx5e: Implement mlx5e interface attach/detach callbacks')
      Signed-off-by: default avatarOr Gerlitz <ogerlitz@mellanox.com>
      Reported-by: default avatarSabrina Dubroca <sdubroca@redhat.com>
      Signed-off-by: default avatarSaeed Mahameed <saeedm@mellanox.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      176b9874
    • Xin Long's avatar
      sctp: return next obj by passing pos + 1 into sctp_transport_get_idx · 4c246863
      Xin Long authored
      
      [ Upstream commit 988c7322 ]
      
      In sctp_for_each_transport, pos is used to save how many objs it has
      dumped. Now it gets the last obj by sctp_transport_get_idx, then gets
      the next obj by sctp_transport_get_next.
      
      The issue is that in the meanwhile if some objs in transport hashtable
      are removed and the objs nums are less than pos, sctp_transport_get_idx
      would return NULL and hti.walker.tbl is NULL as well. At this moment
      it should stop hti, instead of continue getting the next obj. Or it
      would cause a NULL pointer dereference in sctp_transport_get_next.
      
      This patch is to pass pos + 1 into sctp_transport_get_idx to get the
      next obj directly, even if pos > objs nums, it would return NULL and
      stop hti.
      
      Fixes: 626d16f5 ("sctp: export some apis or variables for sctp_diag and reuse some for proc")
      Signed-off-by: default avatarXin Long <lucien.xin@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      4c246863
    • Xin Long's avatar
      ipv6: fix calling in6_ifa_hold incorrectly for dad work · fded2d74
      Xin Long authored
      
      [ Upstream commit f8a894b2 ]
      
      Now when starting the dad work in addrconf_mod_dad_work, if the dad work
      is idle and queued, it needs to hold ifa.
      
      The problem is there's one gap in [1], during which if the pending dad work
      is removed elsewhere. It will miss to hold ifa, but the dad word is still
      idea and queue.
      
              if (!delayed_work_pending(&ifp->dad_work))
                      in6_ifa_hold(ifp);
                          <--------------[1]
              mod_delayed_work(addrconf_wq, &ifp->dad_work, delay);
      
      An use-after-free issue can be caused by this.
      
      Chen Wei found this issue when WARN_ON(!hlist_unhashed(&ifp->addr_lst)) in
      net6_ifa_finish_destroy was hit because of it.
      
      As Hannes' suggestion, this patch is to fix it by holding ifa first in
      addrconf_mod_dad_work, then calling mod_delayed_work and putting ifa if
      the dad_work is already in queue.
      
      Note that this patch did not choose to fix it with:
      
        if (!mod_delayed_work(delay))
                in6_ifa_hold(ifp);
      
      As with it, when delay == 0, dad_work would be scheduled immediately, all
      addrconf_mod_dad_work(0) callings had to be moved under ifp->lock.
      Reported-by: default avatarWei Chen <weichen@redhat.com>
      Suggested-by: default avatarHannes Frederic Sowa <hannes@stressinduktion.org>
      Acked-by: default avatarHannes Frederic Sowa <hannes@stressinduktion.org>
      Signed-off-by: default avatarXin Long <lucien.xin@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      fded2d74
    • WANG Cong's avatar
      igmp: add a missing spin_lock_init() · cac2a9bb
      WANG Cong authored
      
      [ Upstream commit b4846fc3 ]
      
      Andrey reported a lockdep warning on non-initialized
      spinlock:
      
       INFO: trying to register non-static key.
       the code is fine but needs lockdep annotation.
       turning off the locking correctness validator.
       CPU: 1 PID: 4099 Comm: a.out Not tainted 4.12.0-rc6+ #9
       Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011
       Call Trace:
        __dump_stack lib/dump_stack.c:16
        dump_stack+0x292/0x395 lib/dump_stack.c:52
        register_lock_class+0x717/0x1aa0 kernel/locking/lockdep.c:755
        ? 0xffffffffa0000000
        __lock_acquire+0x269/0x3690 kernel/locking/lockdep.c:3255
        lock_acquire+0x22d/0x560 kernel/locking/lockdep.c:3855
        __raw_spin_lock_bh ./include/linux/spinlock_api_smp.h:135
        _raw_spin_lock_bh+0x36/0x50 kernel/locking/spinlock.c:175
        spin_lock_bh ./include/linux/spinlock.h:304
        ip_mc_clear_src+0x27/0x1e0 net/ipv4/igmp.c:2076
        igmpv3_clear_delrec+0xee/0x4f0 net/ipv4/igmp.c:1194
        ip_mc_destroy_dev+0x4e/0x190 net/ipv4/igmp.c:1736
      
      We miss a spin_lock_init() in igmpv3_add_delrec(), probably
      because previously we never use it on this code path. Since
      we already unlink it from the global mc_tomb list, it is
      probably safe not to acquire this spinlock here. It does not
      harm to have it although, to avoid conditional locking.
      
      Fixes: c38b7d32 ("igmp: acquire pmc lock for ip_mc_clear_src()")
      Reported-by: default avatarAndrey Konovalov <andreyknvl@google.com>
      Signed-off-by: default avatarCong Wang <xiyou.wangcong@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      cac2a9bb
    • WANG Cong's avatar
      igmp: acquire pmc lock for ip_mc_clear_src() · ecd6627f
      WANG Cong authored
      
      [ Upstream commit c38b7d32 ]
      
      Andrey reported a use-after-free in add_grec():
      
              for (psf = *psf_list; psf; psf = psf_next) {
      		...
                      psf_next = psf->sf_next;
      
      where the struct ip_sf_list's were already freed by:
      
       kfree+0xe8/0x2b0 mm/slub.c:3882
       ip_mc_clear_src+0x69/0x1c0 net/ipv4/igmp.c:2078
       ip_mc_dec_group+0x19a/0x470 net/ipv4/igmp.c:1618
       ip_mc_drop_socket+0x145/0x230 net/ipv4/igmp.c:2609
       inet_release+0x4e/0x1c0 net/ipv4/af_inet.c:411
       sock_release+0x8d/0x1e0 net/socket.c:597
       sock_close+0x16/0x20 net/socket.c:1072
      
      This happens because we don't hold pmc->lock in ip_mc_clear_src()
      and a parallel mr_ifc_timer timer could jump in and access them.
      
      The RCU lock is there but it is merely for pmc itself, this
      spinlock could actually ensure we don't access them in parallel.
      
      Thanks to Eric and Long for discussion on this bug.
      Reported-by: default avatarAndrey Konovalov <andreyknvl@google.com>
      Cc: Eric Dumazet <edumazet@google.com>
      Cc: Xin Long <lucien.xin@gmail.com>
      Signed-off-by: default avatarCong Wang <xiyou.wangcong@gmail.com>
      Reviewed-by: default avatarXin Long <lucien.xin@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      ecd6627f
    • Christian Perle's avatar
      proc: snmp6: Use correct type in memset · 05968675
      Christian Perle authored
      
      [ Upstream commit 3500cd73 ]
      
      Reading /proc/net/snmp6 yields bogus values on 32 bit kernels.
      Use "u64" instead of "unsigned long" in sizeof().
      
      Fixes: 4a4857b1 ("proc: Reduce cache miss in snmp6_seq_show")
      Signed-off-by: default avatarChristian Perle <christian.perle@secunet.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      05968675
    • Tal Gilboa's avatar
      net/mlx5e: Fix wrong indications in DIM due to counter wraparound · 78b24ab6
      Tal Gilboa authored
      
      [ Upstream commit 53acd76c ]
      
      DIM (Dynamically-tuned Interrupt Moderation) is a mechanism designed for
      changing the channel interrupt moderation values in order to reduce CPU
      overhead for all traffic types.
      Each iteration of the algorithm, DIM calculates the difference in
      throughput, packet rate and interrupt rate from last iteration in order
      to make a decision. DIM relies on counters for each metric. When these
      counters get to their type's max value they wraparound. In this case
      the delta between 'end' and 'start' samples is negative and when
      translated to unsigned integers - very high. This results in a false
      indication to the algorithm and might result in a wrong decision.
      
      The fix calculates the 'distance' between 'end' and 'start' samples in a
      cyclic way around the relevant type's max value. It can also be viewed as
      an absolute value around the type's max value instead of around 0.
      
      Testing show higher stability in DIM profile selection and no wraparound
      issues.
      
      Fixes: cb3c7fd4 ("net/mlx5e: Support adaptive RX coalescing")
      Signed-off-by: default avatarTal Gilboa <talgi@mellanox.com>
      Signed-off-by: default avatarSaeed Mahameed <saeedm@mellanox.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      78b24ab6
    • Tal Gilboa's avatar
      net/mlx5e: Added BW check for DIM decision mechanism · 9854e586
      Tal Gilboa authored
      
      [ Upstream commit c3164d2f ]
      
      DIM (Dynamically-tuned Interrupt Moderation) is a mechanism designed for
      changing the channel interrupt moderation values in order to reduce CPU
      overhead for all traffic types.
      Until now only interrupt and packet rate were sampled.
      We found a scenario on which we get a false indication since a change in
      DIM caused more aggregation and reduced packet rate while increasing BW.
      
      We now regard a change as succesfull iff:
      current_BW > (prev_BW + threshold) or
      current_BW ~= prev_BW and current_PR > (prev_PR + threshold) or
      current_BW ~= prev_BW and current_PR ~= prev_PR and
          current_IR < (prev_IR - threshold)
      Where BW = Bandwidth, PR = Packet rate and IR = Interrupt rate
      
      Improvements (ConnectX-4Lx 25GbE, single RX queue, LRO off)
          --------------------------------------------------
          packet size | before[Mb/s] | after[Mb/s] | gain  |
          2B          | 343.4        | 359.4       |  4.5% |
          16B         | 2739.7       | 2814.8      |  2.7% |
          64B         | 9739         | 10185.3     |  4.5% |
      
      Fixes: cb3c7fd4 ("net/mlx5e: Support adaptive RX coalescing")
      Signed-off-by: default avatarTal Gilboa <talgi@mellanox.com>
      Signed-off-by: default avatarSaeed Mahameed <saeedm@mellanox.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      9854e586
    • Jia-Ju Bai's avatar
      net: tipc: Fix a sleep-in-atomic bug in tipc_msg_reverse · 57360bc3
      Jia-Ju Bai authored
      
      [ Upstream commit 343eba69 ]
      
      The kernel may sleep under a rcu read lock in tipc_msg_reverse, and the
      function call path is:
      tipc_l2_rcv_msg (acquire the lock by rcu_read_lock)
        tipc_rcv
          tipc_sk_rcv
            tipc_msg_reverse
              pskb_expand_head(GFP_KERNEL) --> may sleep
      tipc_node_broadcast
        tipc_node_xmit_skb
          tipc_node_xmit
            tipc_sk_rcv
              tipc_msg_reverse
                pskb_expand_head(GFP_KERNEL) --> may sleep
      
      To fix it, "GFP_KERNEL" is replaced with "GFP_ATOMIC".
      Signed-off-by: default avatarJia-Ju Bai <baijiaju1990@163.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      57360bc3
    • Jia-Ju Bai's avatar
      net: caif: Fix a sleep-in-atomic bug in cfpkt_create_pfx · bb566ce3
      Jia-Ju Bai authored
      
      [ Upstream commit f146e872 ]
      
      The kernel may sleep under a rcu read lock in cfpkt_create_pfx, and the
      function call path is:
      cfcnfg_linkup_rsp (acquire the lock by rcu_read_lock)
        cfctrl_linkdown_req
          cfpkt_create
            cfpkt_create_pfx
              alloc_skb(GFP_KERNEL) --> may sleep
      cfserl_receive (acquire the lock by rcu_read_lock)
        cfpkt_split
          cfpkt_create_pfx
            alloc_skb(GFP_KERNEL) --> may sleep
      
      There is "in_interrupt" in cfpkt_create_pfx to decide use "GFP_KERNEL" or
      "GFP_ATOMIC". In this situation, "GFP_KERNEL" is used because the function
      is called under a rcu read lock, instead in interrupt.
      
      To fix it, only "GFP_ATOMIC" is used in cfpkt_create_pfx.
      Signed-off-by: default avatarJia-Ju Bai <baijiaju1990@163.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      bb566ce3
    • Xin Long's avatar
      sctp: disable BH in sctp_for_each_endpoint · 8cda426a
      Xin Long authored
      
      [ Upstream commit 581409da ]
      
      Now sctp holds read_lock when foreach sctp_ep_hashtable without disabling
      BH. If CPU schedules to another thread A at this moment, the thread A may
      be trying to hold the write_lock with disabling BH.
      
      As BH is disabled and CPU cannot schedule back to the thread holding the
      read_lock, while the thread A keeps waiting for the read_lock. A dead
      lock would be triggered by this.
      
      This patch is to fix this dead lock by calling read_lock_bh instead to
      disable BH when holding the read_lock in sctp_for_each_endpoint.
      
      Fixes: 626d16f5 ("sctp: export some apis or variables for sctp_diag and reuse some for proc")
      Reported-by: default avatarXiumei Mu <xmu@redhat.com>
      Signed-off-by: default avatarXin Long <lucien.xin@gmail.com>
      Acked-by: default avatarMarcelo Ricardo Leitner <marcelo.leitner@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      8cda426a
    • Krister Johansen's avatar
      Fix an intermittent pr_emerg warning about lo becoming free. · c6d4ff85
      Krister Johansen authored
      
      [ Upstream commit f186ce61 ]
      
      It looks like this:
      
      Message from syslogd@flamingo at Apr 26 00:45:00 ...
       kernel:unregister_netdevice: waiting for lo to become free. Usage count = 4
      
      They seem to coincide with net namespace teardown.
      
      The message is emitted by netdev_wait_allrefs().
      
      Forced a kdump in netdev_run_todo, but found that the refcount on the lo
      device was already 0 at the time we got to the panic.
      
      Used bcc to check the blocking in netdev_run_todo.  The only places
      where we're off cpu there are in the rcu_barrier() and msleep() calls.
      That behavior is expected.  The msleep time coincides with the amount of
      time we spend waiting for the refcount to reach zero; the rcu_barrier()
      wait times are not excessive.
      
      After looking through the list of callbacks that the netdevice notifiers
      invoke in this path, it appears that the dst_dev_event is the most
      interesting.  The dst_ifdown path places a hold on the loopback_dev as
      part of releasing the dev associated with the original dst cache entry.
      Most of our notifier callbacks are straight-forward, but this one a)
      looks complex, and b) places a hold on the network interface in
      question.
      
      I constructed a new bcc script that watches various events in the
      liftime of a dst cache entry.  Note that dst_ifdown will take a hold on
      the loopback device until the invalidated dst entry gets freed.
      
      [      __dst_free] on DST: ffff883ccabb7900 IF tap1008300eth0 invoked at 1282115677036183
          __dst_free
          rcu_nocb_kthread
          kthread
          ret_from_fork
      Acked-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      c6d4ff85